7 one ee 


~ aa 


ee ee ee a eee ee ee en 
a) ? = ` 


FRANK S. FREEMAN 


PREFACE TO THE REVISED EDITION 


THIS revised edition, while essentially the same in form as the original 
one, has been improved and brought up to date where possible. The 
following special features might be noted. In this edition there is a 
fuller discussion of test standardization, particularly as regards meth- 
ods of estimating reliability and validity. Some of the more significant 
recently published tests have been included. The treatment of pro- 
jective techniques has been extended in such a manner as to be par- 
ticularly useful to students who are not specializing in clinical psychol- 
ogy. Also, the discussion of tests of specific aptitudes has been 
extended. Throughout, an effort was made to incorporate in discus- 
sions and evaluations the results of representative researches that have 
appeared since the publication of the first edition of this book. One 
other point in particular, should be noted here: namely, that con- 
siderably more attention and emphasis are given in this edition to 
psychological analysis of functions being tested by each of the several 
types of measuring devices. This aspect of the subject was not neg- 
lected in the first edition; but it has been enlarged in this revision. This 
is not to say that factorial analysis is disregarded; it signifies, however, 
that the value of such analysis rests basically upon the psychological 
insights of the test builder at the outset. 

Throughout this edition, more so than in the first, emphasis has 
been placed upon the necessity of interpreting test results in the light 
of the psychological principles involved, of the statistical bases in test 
construction, and of an understanding of developmental and be- 
havioral principles. It is the hope of many of us in this field of psy- 
chology that through such emphasis mechanical use and rule-of-thumb 
interpretation of tests will be discouraged; while, on the other hand, 
the importance of general competence in psychology is stressed. 


vill Preface to the Revised Edition 


The major deletions are the chapters on “Statistics in Psychological 
Testing” and on “Applications and Problems” (Chapters 2 and 16, 
respectively, in the first edition). These chapters have been omitted to 
make room for essential additions and elaborations germane to the 
psychological tests themselves. Instructors who found Chapters 2 and 


16 of the first edition useful will probably be able to refer their stu- 
dents to that volume for these materials. 


January, 1955 


Ps S: F 
Ithaca, New York 


PREFACE TO THE FIRST EDITION 


AN EXAMINATION of the volumes by Guy M. Whipple, Manual 
of Mental and Physical Tests * will reveal the changes and develop- 
ments that have taken place in the field of psychological testing in the 
last forty years. At the same time, such an examination will also reveal 
the extent of the indebtedness of modern testing practices to the work 
of Whipple, his contemporaries, and his predecessors. When Whipple 
wrote his Manual, he did so in order to bring together, for the first 
time, acomprehensive and balanced description of psychological tests, 
representing what he called the “simpler” and the “complex” proc- 
esses. 

In the developing scientific field of psychological testing, there is 
recurrent need for periodic presentation of comprehensive descrip- 
tions of tests. This volume is intended to meet such a need. At the 
same time, however, I have not limited this volume only to descrip- 
tions of these psychological instruments. 

Shortly after having begun, some years ago, to teach courses in 
psychological testing, individual differences, and clinical procedures, 
I was convinced that clinicians and other users and interpreters of 
test results must have an understanding of the theoretical principles 
and assumptions upon which tests are constructed. For that reason I 
have at several points in this volume presented basic theories and 
principles, independent of any specific test or group of tests; theories 
and principles which are common to a wide range of tests. In addi- 
tion, other theoretical principles, assumptions, and problems have 
been presented, where relevant, in conjunction with the descriptions 
and evaluations of particular tests or with several belonging to the 


* Originally published in 1910. Revised and published in two volumes in 1914 
and 1915 (Baltimore, Warwick and York). 


x Preface to the First Edition 


same category of instruments. The mere discussion of basic theories, 
entirely separate and distinct from descriptions and evaluations of 
specific testing devices, is a relatively barren procedure for anyone 
except those who are already experienced and qualified in the sub- 
ject. It is for this reason that I have described many representative 
tests in detail, both as to psychological and statistical aspects. The 
student will thus have specific substance to project against and com- 
pare with theories previously presented. Knowledge of how to inter- 
pret tests is best achieved through combined unde 
and familiarity with test content. 

Probably no two authors would be in complete agreement as to 
tests to be included, though their selections would have many in 
common. I believe, however, that the devices included in this book 
are representative of the sounder ins 
though in a few instances poor tests have been mentioned or described 
for the purpose of illustrating a relevant aspect of the subject. The 
tests have been so grouped, it is hoped, as to prove most useful to 
those interested in particular types or levels. 

Some historical and developmental background of the subject is 
provided, especially the work of Alfred Binet; for I believe the student 
achieves a fuller and sounder appreciation of the present status of the 
science and its applications through a presentation of early work and 
thinking. For the advanced student, however, the historical and de- 
velopmental background provided herein should not suffice; he should 
consult more detailed and comprehensive studies. * 

This volume is intended Primarily for students who plan to enter 


Professions in which Psychological tests are administered and the 
results interpreted in dealing with adjustment and numerous other 
psychological problems. Thus it is 


designed for the use of clinical 
(who are also clinicians), guid- 


rstanding of theory 


truments currently available; 


psychologists, school Psychologists 
ance counselors, teachers, Psychiatrists, pediatricians, social workers, 
and personnel officers. Without at all minimizing the usefulness of 
group testing and group studies of psychoeducational, psychosocial 
and purely Psychological problems, the emphasis herein is on indi- 
vidual and clinical interpretation of test findings. While recognizing 


a social vacuum and that his 


* E. g., Jos. Peterson, Early Conceptions z ; 
(Yonkers, N. Y. World Roch oe | fons and Tests of Intelligence, 1925, 


Pitan Hh E. J. Varon, “T r 
Binet’s Psychology, Psychological Monographs, kosor cae Aoi 


Preface to the First Edition xi 


behavior and performances can be fully understood only in terms 
of himself and of his environments, the fact is, nevertheless, that in 
psychological testing the ultimate unit of concern to the examiner 
usually is the individual subject. After the results of testing are avail- 
able, in an individual instance, the psychologist may then seek causal 
and explanatory factors. 

While formal preparation in statistics, especially in statistical rea- 
soning, is desirable, even in the case of students beginning the study 
of psychological testing, it is not possible to provide that prepara- 
tion in a book such as this. But understanding of the more common 
Statistical indexes, methods, and reasoning is necessary. Hence, a 
chapter on “Statistics in Psychological Testing” has been included for 
those students who have not had formal preparation. I am pleased to 
acknowledge my indebtedness to my colleague, Professor T. A. Ryan, 
who collaborated by assuming major responsibility for this chapter 
which is within the area of his major teaching interests. 

Psychologists will note, of course, that a large and important seg- 
ment of psychometric methods and devices has been almost entirely 
omitted from this book; that is, the psychophysical methods of Weber, 
Fechner, Miiller, Urban, and their successors. The reason for the 
omission is that psychophysical measurements comprise an area in 
themselves and are beyond the scope and purpose of this volume. 

Any author of a textbook is indebted, of course, to the many scien- 
tists and scholars who have preceded him and contributed the mate- 
rials from which the book is developed. In specific instances of 
indebtedness, I have acknowledged authors and sources at each point, 
where the documentation is most useful. In particular, I wish to thank 
the authors and their publishers who kindly gave permission to re- 
produce textual, tabular, and graphic materials, 

One or more chapters were read in manuscript by Dr. Solomon 
Machover, Professor Max L. Hutt, and Professor Frederick L. 
Marcuse. I wish to acknowledge my appreciation of their valuable 
criticisms and suggestions. 


Be SE 
Cornell University 
November 17, 1949 


CONTENTS 


CHAPTER 


1. 


[Š] 


BASIC THEORETICAL PRINCIPLES 
Objectivity in A dministering and Scoring. A Represent- 
ative Population Sample. Sampling of Traits and 
Functions. Steps in the Development of a Test. 
Reliability. Validity. 


INTERPRETATION OF TEST SCORES: QUANTITA- 
TIVE AND QUALITATIVE 


An Index of Relative Rank. Psychological Measure- 
ment Contrasted with Physical Measurement. 

Clinical Aspects. Difference between Norms and 
Standards. Factors in Selecting a Test. 


DEFINITIONS AND ANALYSES OF INTELLIGENCE 
Definitions of Intelligence. Two Comprehensive 
Definitions. Implications for Test Design and Content. 
Three “Kinds” of Intelligence. Analyses of Mental 
Ability. Factor Analysis. Illustrations of Factors. 
Implications. 


THE BINET SCALES 
Historical Background. The Early Work of Alfred 
Binet. The 1905 Binet-Simon Scale. The 1908 Binet- 
Simon Scale. The 1911 Revision of the Binet Scale. 
Summary. 


PAGE 


6o 


96 


i nis 
ziy Conte 
PAGE 
CHAPTER 


T 5. EARLY REVISIONS OF THE BINET-SIMON SCALE Lig 
Four Early Revisions. The Stanford Revision of 1916. 


. 6. THE 1937 REVISION OF THE STANFORD-BINET 
SCALE 129 


Description of the 1937 Scale. Validation. Reliability. 
Determining Mental Age and Intelligence Quotient. 
Distribution of 1Q’s. Suggested Classification of 
Revised Stanford-Binet 1Q's. Analysis of Functions 
Tested. Types of Items. The Short Scale. Evaluations 
and Criticisms. 


7. THE WECHSLER SCALES 156 

Description of the Wechsler-Bellevue Intelligence Test. 

Functions Involved in the Subtests. Need for an Adult 

Scale. Standardization. The Population Sample. 

Validity. Reliability. Scoring and 1Q Calculation. 

Special Features of the Bellevue Scale. Criticisms and 
Evaluations. The 1955 Revision of the Bellevue Scale. 

The Wechsler Intelligence Scale for Children (1949). 


8. INDIVIDUAL PERFORMANCE SCALES 
The Pintner-Paterson Scale of Performance Tests. The 
Cornell-Coxe Performance Ability Scale. The Arthur 
Point Scale of Performance Tests. Revised Arthur 
Scale: Form II. Other Performance Tests. Functions 


Tested by Performance Scales, Evaluation of 
Performance Tests. 


9. SCALES FOR INFANTS AND PRESCHOOL 
CHILDREN 


Gesell Developmental Schedules. Minnesota Pre- 
school Scale. Cattell Developmental and Intelligence 
Scale. Merrill-Palmer Scale of Mental Tests. 


Evaluation of Scales for Infants and Preschool 
Children. 


N 
ie) 
[0] 


10. NONVERBAL GROUP SCALES OF MENTAL ABILITY 
Beginnings. Characteristics of Group Tests of Mental 


Contents xv 


CHAPTER PAGE 


11. 


12. 


T3: 


` 14. 


Ability. Pintner-Cunningham Primary Test. Chicago 
Nonverbal Examination. Revised Army Beta Examina- 
tion. Pintner Nonlanguage Series: Intermediate Test. 
Nonlanguage Multi-Mental Test. Pattern Perception 
Test. Progressive Matrices Test. Cattell Culture—Free 
Test. Goodenough Drawing Test. Davis-Eells Test of 
General Intelligence. Evaluation of Nonverbal Group 
Scales. 


VERBAL AND MIXED GROUP SCALES OF MENTAL 


ABILITY 390 
California Tests of Mental Maturity. Terman- 
McNemar Test of Mental Ability. Tests of Primary 
Mental Abilities. Kuhlmann-Anderson Tests (6th 
Edition). Group Scales for College Freshmen. Army 
General Classification Test. Miller Analogies Test. 
Other Group Scales. Evaluation of Group Scales. Uses 
of Group Scales. 

APTITUDE TESTS: MECHANICAL AND CLERICAL 
Definition and Explanation. Tests of Vision and 
Hearing. Motor and Manual Tests. Tests of Mechani- 
cal Aptitude. Tests of Clerical Aptitude. Differential 
Aptitude Tests. Aptitude Classification Tests. 


06 


us 


J 
J 
an 


APTITUDE TESTS: FINE ARTS AND PROFESSIONS 33 
Tests of Musical Aptitude. Tests of Aptitude in the 
Graphic Arts. Tests of Aptitude in Medicine. Tests of 
Aptitude in Law. Tests of Aptitude for Teaching. 
Tests of Science and Engineering Aptitudes. Interest 
Inventories. General Evaluation of Aptitude Tests. 
Steps in an Aptitude Testing Program, 


TESTS OF EDUCATIONAL ACHIEVEMENT 377 
Scope. Purposes. Derived Indexes. Types of Items. 
Three Representative Batteries. Reading Tests. Arith- 
metic Tests. Tests at High School and College Levels. 
Tests of Aptitude in Specific Academic Subjects. Tests 
of More Complex Educational Objectives. Evaluation 
of Achievement Tests. Tests of Proficiency. 


Contents 


xvi 
CHAPTER 
15. INTELLIGENCE TESTS AS CLINICAL 


16. 


18. 


19. 


20. 


21. 


INSTRUMENTS 
Factors Which Affect Test Performance. The Stanford- 
Binet Scale. The Bellevue Scale. Kent Series of 
Emergency Scales. A Report Outline. 


TESTS OF MENTAL IMPAIRMENT 
The Babcock Test. Tests of Concept Formation, The 
Hunt-Minnesota Test for Organic Brain Damage. The 
Bender Visual-Motor Gestalt Test. Evaluation of Tests 
of Impairment. 


PERSONALITY RATING SCALES 
Definition of Personality. Rating Scales: Major 


Aspects. Representative Rating Scales. Evaluation of 
Rating Scales. 


PERSONALITY INVENTORIES 
Purposes and Types of Inventories. Representative 
Inventories. Biographical Data Questionnaires. Tests of 
Attitudes and Values. Opinion Polling. Evaluation of 
Personality Inventories. 


PROJECTIVE METHODS: THE RORSCHACH AND 


THE THEMATIC APPERCEPTION TESTS 


Definition and Explanation. The Rorschach Test. 
Thematic Apperception Test. 


PROJECTIVE METHODS: VARIOUS 


Word Association Tests. Picture Tests. Verbal Com- 
pletion Tests. Drawing and Painting. Play. Evaluation 
of Projective Tests. 


SITUATIONAL TESTS 


Sociometric Methods. Tests of “Social Intelligence” 
and Leadership. Psychodrama. Office of Strategic 
Services: Assessment Tests. Evaluation of Situational 
Tests. 


INDEX 


PAGE 


DET. 


599 


I. 


AAAA AAAA coes aos Soa Soas SSeS A SEALED UNN NUNNA 


BASIC THEORETICAL PRINCIPLES 


ALTHOUGH tests of general intelligence, specific aptitudes, person- 
ality, and educational achievement are designed and constructed for 
different’ purposes, all of them have certain principles and procedures 
in common; and any combination of these categories of tests might be 
used in dealing with a specific individual case or in attempting to solve 
a particular psychological problem. Psychological tests have been used 
to find answers to a number of psychological questions, both theoreti- 
cal and practical. But ultimately, and most important, they are in- 
tended to contribute to the analysis and description of individuals, and 
to the evaluation, prediction, and guidance of their behavior and edu- 
cation. The following are the aspects that are common to the several 
types of psychological instruments and that give them their objectiv- 
ity: objectivity in administering and scoring; norms based upon a 
population sampling, scientifically selected, for a particular test; sam- 
pling of specified traits or functions, by means of a particular test; 
incorporation, within a test, of a composite of views of a number of 
experts; utilization of recognized techniques of test standardization. 
The tests’ objectivity and the standardization process give them a sci- 
entific quality which, of course, is absent in an individual's personal 
estimate of psychological traits and functions. We define a psychologi- 
cal test as a standardized instrument designed to measure objectively 

| Ene or more aspects of a total personality, by means of samples of 
Performance or behavior. 


2 Basic Theoretical Principles 


OBJECTIVITY IN ADMINISTERING AND SCORING 


Each psychological test is administered under a prescribed set 
of procedures. These procedures involve preparation of the persons to 
be tested by means of introductory and explanatory remarks; phrasing 
and formulation of instructions whereby each part, or each item in 
some instances, is to be presented; setting time limits, if any; decisions 
as to when to repeat instructions or to offer encouragement, and when 
not to do so, as well as when to answer questions asked by a subject, 
and when not; the use of practice exercises, if any. 

The scores and ratings thus derived from the tests are not depend- 
ent upon the individual bias or judgment of the particular examiner. 
For the score of any subject on an objective test is arrived at by the 
use of a scoring key; or the scoring is otherwise so clearly defined, 
specified, and illustrated that subjective judgments of individual ex- 
aminers or scorers do not enter in at all or are reduced to a minimum. 
Thus an objective test provides a highly uniform means of evaluating 
the psychological traits or functions being measured; results obtained 


by one competent examiner are comparable with those obtained by 
others. 


A REPRESENTATIVE POPULATION SAMPLE 


Every test is designed and intended for use with a specified 
population, or group. For example, a test of intelligence may be 
standardized for use with individuals from the age of two years 
through adulthood (Stanford-Binet, 1937 revision); another for ages 
eleven to seventeen (Chicago test of Primary Mental Abilities); an- 
other primarily for adults (Wechsler-Bellevue ). Still others cover dif- 
ferent age ranges. 

A test of scholastic achievement in a particular school subject or 
group of subjects may be intended for the first three grades; or for 
grades eight through twelve; or for college freshmen; or for other 
grade ranges, depending upon the school subject and the prescribed 
scope. 

Tests of specific aptitudes likewise are designed for specified pop- 
ulations. For example, one test of ability in art is designed for grades 
seven and above; one test of mechanical aptitude is to be used for ages 
eight to twenty-one; a law aptitude test is standardized, of course, for 


A Representative Population Sample 3 


college students and others who are candidates for admission to a law 
school, regardless of age. 

Rating scales, personality inventories, and projective tests are like- 
wise intended for use with a specified segment of the total population. 
They may be designed for a selected age group; for particular occupa- 
tions; for given educational levels; for one sex of limited age range; 
for the diagnosis of clinical cases, or for use with non-clinical popu- 
lations as well. 

In any event, whatever the traits or functions to be measured, what- 
ever the range of ages or school grades, and whether for clinical or 
non-clinical groups, the test must be standardized upon a group that 
is a representative sample of the total population for which it is in- 
tended. Each test must be constructed by means of actually sampling 
the performance of an adequate group which has been selected in such 
a way as to insure its being typical of the population of which it is a 
part. 

Factors to be taken into account in making a population sampling 
will depend upon the nature and the comprehensiveness of the test 
under cénstruction. In any instance, the sample should yield unbiased 
data on the population of which it purports to be representative; and 
the sample should be large enough to provide statistically valid results 
for the traits or functions being measured by the test. 

This means, of course, that the author of a test must decide at the 
outset with which group, with what segment of the population his 
instrument is to be used. Then he must standardize his test upon a 
population sample that is stratified according to relevant factors; and 
within each stratum the selection of cases should be adequate in num- 
ber and of correct proportion in the total.’ 

For example, if a psychologist is to construct a test of “general in- 
telligence” for American children, in the primary grades, ranging in 
age from five to nine years, he will have to take into account the fol- 
lowing factors in obtaining his standardization population: age, sex, 
geographic area, parental occupational level, and type of community 
(urban, village, farm). The author of the test must decide, also, 
whether he will standardize his test entirely on a Caucasian popula- 
tion, or whether he will include non-Caucasian elements, If it is to be 


1 On sampling, see M. B. Parten, Surveys, Polls, and Samples, New York: 
Harper, 1950. 


4 Basic Theoretical Principles 


the latter, then the racial factor must be taken into account in the 
stratification of the sample. 

Since individuals within a representative sample of children of any 
given age vary widely in respect to mental abilities, some will reach 
only the levels of younger age groups while others will attain the levels 
of older age groups. Thus, to ascertain the developmental level of the 
retarded it is necessary to extend downward the chronological age of 
the standardization sample; and, conversely, it is necessary to extend 
upward the age limit for the superior. 

The validity of results obtained with any psychological test will be 


dependent, in part, upon the adequacy and representativeness of the 
standardization population. 


SAMPLING OF TRAITS AND FUNCTIONS 


Any given test measures a limited aspect of the person being 
examined, though some tests are much more restricted in scope than 
others. It is essential, therefore, that the test builder define the aspect, 
or aspects, he proposes to measure. After doing this, he must develop 
a series of test items that will best sample the traits or functions with 
which his test is concerned. 

In developing a psychological test, it is impossible, and in fact un- 
necessary, to use an unlimited number of items. It is not necessary to 
attempt to present the individual being tested (called the “subject” or 
the “testee”) with problems that will ascertain his responses for every 
conceivable situation involving a given trait or function. It is sufficient 
to get an adequate sampling of responses in a particular area or range 
of behavior, the assumption being that the sampling is representative 
of the whole. 

Two kinds of sampling are actually involved in constructing a psy- 
chological test. First, the most relevant constituents of the gross varia- 
ble (the broad, comprehensive trait or function) must be selected. 
Where, for example, the gross variable is “general intelligence,” the 
constituent parts in the test might be: vocabulary, verbal comprehen- 
sion, arithmetical problems, reasoning with practical problems, verbal 
and other analogies, perceptual organization, and so forth. Second, the 
operational levels (that is, the actual items) must be selected: which 
arithmetical processes and at what levels, what kinds and which levels 
of words, what types and range of situations, which perceptual figures? 

In following this procedure, psychologists are employing a well- 


Sampling of Traits and Functions 5 


known and widespread technique. If a chemist wishes to determine 
the quality of a shipment of milk, he takes small quantities here and 
there, combines these, and then analyzes a sample of the samples, If 
an agronomist wishes to analyze a given area of soil, he gathers small 
amounts from various spots. If a blood test is to be made, a very small 
quantity, taken from one place, is sufficient and representative of the 
entire stream. Numerous other illustrations can be found. So, too, with 
intelligence, specific aptitudes, and school achievement. It has been 
said that psychological testing may be thought of, figuratively, as sink- 
ing shafts here and there within a given range in order to ascertain 
depth and quality. 

Specifically, for present purposes of illustration, intelligence may be 
defined in several ways: (1) capacity to integrate experiences and to 
meet a new situation by means of appropriate and adaptive responses; 
(2) capacity to learn; (3) capacity to carry on abstract thinking.” 
While psychologists differ in regard to which of these three aspects is 
most important and which they would emphasize, the fact is that most 
tests of general intelligence probe and sample all three. The following 
types of ‘items found in various current tests fall under one or more 
of these definitions, and are constituent parts of the gross variable, 


general intelligence. 


Practical reasoning. “What’s the thing for you to do when you . 
have broken something which belongs to someone else?” (From the 
Revised Stanford-Binet Scale, Form L.) 

Definitions of words: i.e., concept formation. 

Perceiving similarities and differences between objects: For ex- 
ample: “In what way are wood and coal alike?” “In what way are 
a baseball and an orange alike and in what way are they different?” 
(From Revised Stanford-Binet Scale, Form L.): i.e., abstraction and 


generalization 7 DIER 
General information tests: i.e., assimilation and retention of ex- 


periences 


? Intelligence will be defined and described in a later chapter. There are 
some psychologists who prefer to discard the term intelligence because they 
believe it is not a function in itself but rather an aggregate of particular apti- 
tudes or, to use a more recent term, of “primary mental abilities.” Like many 
other psychologists, we continue to use the term intelligence, which we prefer 
for two reasons: (1) It has general and meaningful currency now, especially 
in connection with tests. (2) Even if certain abilities were “primary,” intelli- 
gence would not be a mere aggregation of these, but rather an integration, in 
which case intelligence is really something new, different from, and more than 


its several constituent parts. 


6 Basic Theoretical Principles 


Arithmetical reasoning: i.e., reasoning with abstractions ; 

Supplying missing parts to pictures: i.e., perceptual integration 

Reproducing geometric figures from memory: 1.€., visual imagery 
and organization 


Arranging a series of pictures in logical sequence: 1.¢., visual per- 
ception and reasoning 


Perception of color and form design: i.e., visual imagery and re- 
call, analysis and organization 


Explanation of absurdities in given pictures: i.e., analysis of visual 
percepts 


Oral solution of practical problems orally presented: i.e., analysis 
and generalization 


Solving problems involving distances and directions (without use 
of paper and pencil): i.e., spatial orientation 


Deriving and giving the meanings from a prose passage: i.e., rea- 
soning with abstractions 


Another method of determining the component parts of the gross 
variable, which in this case we call general intelligence, is through 
“factor analysis,” to be discussed more fully in Chapter 3. According 
to one analysis, there are six such components, relatively independent 
of one another: * facility with numbers (the four fundamental proc- 
esses), vocabulary (word meaning), space perception (perceiving 
similarities of and differences between geometric figures), word flu- 
ency (controlled word association), reasoning (insight into patterns of 
letters arranged in series), and memory (immediate recall of discrete 
verbal materials). Tests have been constructed on the basis of this 
analysis, items having been devised for each of the six categories. 
At present, also, there is a trend among some psychologists toward 
analysis of the gross variable’s component parts into subdivisions; that 
is, into elements of the component parts. For example, the component 


“reasoning” has been tentatively analyzed into the following four as- 
pects: * 


Reasoning I 
a. manipulating symbols 
b. solving problems 
c. defining problems 
d. testing hypotheses 


*L. L. Thurstone and T. G. Thurstone, The Chicago Test of Primary Mental 
Abilities, Chicago: Social Science Research Associates, 1943. 

1J. P. Guilford, et al.. A Factor-Analytic Study of Reasoning Abilities, Los 
Angeles: University of Southern California, Report Number 1, June 1950. 


Sampling of Traits and Functions 7 


Reasoning H : 
a. seeing rules or principles (induction) 
b. seeing systems 
c. seeing trends 
d. seeing relations (educing relations) 
seeing identity of relationships 
analyzing forms 
Reasoning III ; 
a. seeing common elements or properties 
b. classifying (in general) 
c. classifying forms 
d. educing correlates 
Reasoning IV 
a. drawing inferences (deduction) 
b. syllogistic reasoning 
Inspection of these four types of reasoning reveals that they are 
Neither mutually exclusive nor independent of one another. Yet if 
these and their parts are sufficiently distinct and constitute reasoning 
in its several aspects, and if reasoning were to be measured according 
to this scheme, it would be necessary to devise items for each of the 
four type’ and for each sub-type. 
| Specific aptitude, as another example, may be defined asa capacity 
that indicates the probable degree of successful learning and achieve- 
Ment in a particular and limited type of activity—for example, musi- 
cal, mechanical, artistic, or linguistic aptitude. A test intended to esti- 
Mate a person’s capacity in each of these must include parts and items 
sufficient in number and extensive enough in scope to provide an ade- 
quate sampling upon which a prediction of subsequent learning and 
achievement may be based. ; . 

The constituent parts of a test of “mechanical aptitude,” for in- 
stance, might be: knowledge of tools and mechanical devices, skill 
in assembling parts of a mechanism, perception of spatial relations, 
manual or digital dexterity, or others found through statistical and 
Psychological analysis to have predictive and selective value. 

Personality tests and inventories must also be based upon samplings 
of the constituent traits that the test author proposes to evaluate. This 
is true even though personality itself is most difficult to define and 
though its components are elusive. Thus we have inventories that at- 
tempt to measure, among others, degrees of introversion-extroversion, 
neurotic tendencies, anxiety, hypochondriasis, adjustment to home, 
adjustment to school, and dominance-submission. Each author of a 


ma 


8 Basic Theoretical Principles 


personality inventory or of a projective test, in order most adequately 
to fulfill his purposes, must determine which aspects of personality 
are to be examined by means of his instrument. 

An educational achievement test is designed to measure an individ- 
ual’s information in, or skill with, or understanding of—or all three of 
—a given subject of study taught in school; for example, reading 
rate and comprehension, arithmetical processes and problem solving, 
American history, English usage, and so on. In each instance, as in 
all other types, the scope of the test must be defined, the parts of the 
gross variable must be determined, and the elements of each part must 
be represented. Educational achievement tests depend for their va- 
lidity upon the adequacy with which they sample the subject-matter 
field for which they are intended. 

Tests of general intelligence, of specific aptitudes, and of educa- 
tional achievement are intended, of course, to indicate the person’s 
status, at the time of examination, in their respective areas; but they 
are intended, also, for other purposes. Intelligence tests are employed 
to predict an individual's probable future level of mental development 
and capacity. Tests of specific aptitude are used to predict’ probable 
future learning of and performance in a particular activity or occupa- 
tion. Results of educational achievement tests are helpful in forecast- 
ing the subject’s probable future level and quality of learning in the 
several school areas and in diagnosing specific difficulties and disabili- 
ties in basic school subjects. The great importance, therefore, of ade- 
quately conceived and satisfactorily standardized instruments is readily 
apparent. 

In respect to determination of an individual’s present status, person- 
ality inventories and projective tests are like the three other types 
mentioned above. And while personality tests are used, to some ex- 
tent, for predicting future behavior and adjustment, their greater sig- 
nificance and usefulness, in relation to one’s future status, lies in the 
fact that their results are very valuable in diagnosing personality prob- 
lems and as a basis for psychological counseling, or therapy where 
indicated. 


STEPS IN THE DEVELOPMENT OF A TEST 


In devising a test it is necessary, as already explained, first to 
define that which is to be measured. Tests of intelligence, specific apti- 
tude, school achievement, and personality have already been briefly 


Steps in the Development of a Test 9 


defined; and illustrative materials will be presented in later chapters. 
At this point, several illustrations will suffice to demonstrate how psy- 
chological and statistical analysis determine the form and content of 
a test. 

Alfred Binet, the French psychologist of whose work much more 
will be said later, proposed that the following be tested as components 
of intelligence: memory, mental images, imagination, attention, com- 
prehension, suggestibility, esthetic appreciation, muscular strength, 
strength of will, motor skill, and visual judgment. Some of these sur- 
vived his own experimental investigations and those of other psycholo- 
gists, while others were rejected as being invalid. Binet suggested the 
Measurement of these processes in the first place because he believed 
that they differ sufficiently from person to person and that knowledge 
of their level of development in different persons would permit the 
Psychologist to distinguish one person’s general mental development 
from that of others, Here we have the beginnings of the tests which 
Were to prove so useful in the construction of his and other scales. 

Subsequently, psychologists were to identify other and more specific 
Processes which, they held, should be tested. Thus Spearman devel- 
Oped the theory that intelligence is essentially a generalized function 
(g) and should be measured through a broad sampling of mental 
activity. E. L. Thorndike, on the other hand, maintained that intelli-_ 
gence consists of a multitude of highly specific processes (unnamed) 
and that the validity of a test of intelligence will depend upon sam- 
Pling of these with psychological insight. L. L. Thurstone, among 
Others, developed still a different theory: namely, that what we call 
“intelligence” is made up of a number of “primary mental abilities,” 
each of which must be sampled, with a “secondary” general factor in- 
volved. These theories and other problems concerning the nature of 
intelligence will be dealt with in Chapter 3; for the present we are con- 
cerned only with definitions and analysis of processes as necessary 
Steps in the development of tests. 

The Seashore tests in music include the following processes: pitch 
discrimination, judgment of tone intensity, perception of time, dis- 
crimination of tonal timbre, tonal memory, and rhythm discrimination. 

In testing knowledge of and skill in language, any or all of the fol- 
lowing might be included: grammar, usage, punctuation, vocabulary, 
Teading rate, reading comprehension, visual acuity, eye movements, 


and auditory acuity. 


10 Basic Theoretical Principles 


Regardless of the particular definition of intelligence that commends 
itself most to a psychologist, and regardless of what analysis into par- 
ticular processes appears to have greatest value, any psychological test 
must measure the trait and function in its manifestations, in one form 
or another. The task then is to devise and select items conforming to 
the definition of the function or trait and to the analysis made of it. 

After the original group of items has been devised and selected, 
they are given a series of try-outs on the groups for which they are 
intended. The results thus obtained are subjected to established statis- 
tical analyses and to scrutiny on the basis of a series of criteria of 
validity and reliability. As a result of this first analysis and scrutiny, 
some items are rejected, some are retained, and new ones are added. 
The new and more highly selected items are again put through the 
same process of validation, resulting in further improvements and 
refinements of the test under construction. This will be repeated sev- 
eral times before the finished scale emerges. The entire process of 
try-outs, statistical analysis and evaluation on the basis of accepted 
criteria is called test standardization, Standardization of tests is a topic 
which will recur with some frequency in this volume. For the present, 
however, it will suffice to indicate what are the most important aspects 
involved in the process, in addition to those already discussed, 


/RELIABILITY 


The two essential characteristics of any sound test are relia- 
bility and validity. 

The term reliability has two closely related but somewhat different 
connotations in psychological testing. First, it refers to the extent to 
which a test is internally consistent; that is, the extent to which the 
test scores are subject to or free from such internal defects as will 
produce errors of measurement due to the quality of the items rather 
than to the instability of performances of testees themselves. In other 
words, how accurately is the test measuring at a particular time? Sec- 
ond, reliability refers to the extent to which an instrument yields con- 
sistent results on testing and retesting. That is, how dependable is it 
for predictive purposes? Obviously, if a test does not have a high de- 
gree of reliability, it can have but limited value, if any, in predicting 
an individual’s future performance or level of development. It is clear 
that these two aspects of reliability are intimately related; if a test is 


Reliability 11 


not highly reliable on any particular occasion, it can have little predic- 
tive value. 

Since one of the principal uses of psychological tests is to predict 
and plan for subsequent development and performance, a high degree 
of reliability is a sine qua non of a sound instrument. Reliability is not, 
however, an all-or-none proposition; it is a matter of degree. No test 
is perfectly reliable; the scores obtained on repeated testings are not 
completely stable, either in terms of internal consistency or of pre- 
diction, There are always some errors of measurement, large or small; 
and it is “normal” for humans to vary in performance, generally within 
fairly narrow limits, from one occasion to another—to vary, that is, 
aside from the expected changes that occur as part of the process of 
growth and development. 

Differences among the test scores of individuals in a group are due 
to: (1) “true,” or actual, differences in the trait being measured 
Within those persons being examined, and (2) sources of inaccuracy 
in the measurement of individuals. These sources of inaccuracy may 
be inherent defects of the test itself, conditions or “chance” factors 
Operating at the time of testing, or unpredictable fluctuations in the 
Performance of the subjects. Standardization aims to eliminate or re- 
duce inherent defects of the test; the conditions of testing and retesting 
Should be as nearly consistent and optimal as possible; and though 
Minor fluctuations in an individual’s performance from day to day or 
Week to week cannot be controlled, the reasons for any major fluctua- 
tions must be sought in the individual himself or in some of the en- 
Vironmental forces, if the sources of inaccuracy are to be understood. 

The possible sources of variation in performances on a test are 
many. This aspect of testing will recur frequently in our subsequent 
discussions: but for the present, the most common of them may be 


listed as follows: 
Actual, or “true,” differences among individuals in the general 
traits or general abilities being measured. 
Specific abilities required in a particular test; or specific disabilities 
in the functions being tested. 
Skill in taking tests; being “test wise,” or the converse. 


The “chance” acquisition of a particular piece of knowledge or 
information required in a test: ¢.g., the meaning of an unusual word 


12 Basic Theoretical Principles 


such as ambergris, or a bit of unusual information such as the name 
of the author of a little-known work. (These would be poor test 
items.) 


Effects of practice (previous test-taking) or, in some instances, 
coaching. 


Normal or expected fluctuations in performance from time to 
time. 


Personal characteristics of the testee: motivation, health, energy 
level, emotional status. 


Physical conditions under which the test is taken: heat, light, 
ventilation. 


Unpredictable, or “chance” factors: noise, interference, broken 
pencil, misunderstanding of instructions, etc. 


Fortunate guessing of answers. 


Test results, ideally, should depend upon the extent to which the 
test measures the first two of these sources of variation; actually, how- 
ever, the coefficients of reliability will be adversely affected by the 
nonsystematic operations of the others. 

There are two methods, in general, of expressing the consistency, or 
dependability, of test results: (1) absolute reliability, and (2) relative 
reliability. The first of these is usually stated in terms of the standard 
error of measurement, The second is given, though infrequently, in 
terms of analysis of variance, or, much more commonly, as a correla- 
tion coefficient indicating the degree to which individuals maintain 
relatively consistent positions in their group when a single test is ad- 
ministered twice, or when two equivalent forms of a test are applied 
to all members of a group. This correlation between the two sets of 
scores is known as the coefficient of reliability, 


Test-retest reliability. When persons are tested and retested a num- 
ber of times, they may undergo some change as a result of repeated 
measurements: e.g., in the form of practice effects, improvement in 
the skill of taking tests, and in the “set” or attitude toward a test. In 
estimating reliability, therefore, it is necessary to limit the number of 
times an individual is tested with the same device. Hence, instead of 
frequent retesting of the same persons, dependable results for a given 
psychological instrument are obtained by increasing the number of 
persons tested rather than by increasing the number of measures of 


Reliability 13 


each person. Therefore. techniques have been devised for evaluating 
the results obtained with only one or two measurements of the same 
individuals, namely: 

Two equivalent forms of the test are administered and the two 


Sets of scores are correlated. 
A single form of the test is administered twice and the two sets of 


scores are correlated. 

The items of a single test are subdivided into two separately scored 
groups, the two sets of scores being correlated as though they were 
obtained from two forms or two testings. 

Administering two equivalent tests has several disadvantages. The 
Procedure requires more time, of course. The two forms might vary 
somewhat in content, thus underestimating reliability of either form. 
The experience of having taken the first test might result in some 
learning or improved skills. If two forms of a test are to be equivalent, 
they must be in the same format and each must test a representative 
Sampling of items measuring the same mental processes. Also, the 
Original testing and the retesting should take place within a week or 
two in order to minimize the influence of intervening factors of de- 
Velopmental and other individual changes. 

Administering the identical test twice has some of the disadvantages 
of using two equivalent forms. It is held by some investigators that 
Tecall of answers to specific items of a test is an added disadvantage, 
when the identical test form is given a second time. Although there 
Can be some recall, it is unlikely that this possibility will be an impor- 
tant consideration, for the number of items in any test is too large for 
the retention of many. When this method of estimating reliability is 
used, the interval between testings should be a week or two in order 
to minimize the effects of whatever recall might be operative. 


Split-half reliability. A test cannot have high consistency in retesting 
Unless each application is relatively free of chance or random errors. 
In Split-half testing of reliability, chance and random errors may be 
assumed to operate equally in both halves. By 

È Calculating reliability by the split-half method consists of subdivid- 
Ing the whole test into two parts, presumably equivalent, and then 
treating the score of each part as though it were a separate form. This 
Method provides, essentially, a measure of the test’s internal consist- 
ency, assuming an equal level of performance throughout the test by 
each person, Split-half reliability is a first check upon the usefulness 


14 Basic Theoretical Principles 


of a test. It is easily found and saves unnecessary labor that might be 
spent in following up an internally unsound device. This method tells 
us if the test is a reliable representation of an individual's traits at a 
given time. It does not describe completely the reliability of a test 
which is to be used periodically or for predictive purposes. For peri- 
odic and predictive testing, the test-retest method is desirable. 

The split-half method of determining reliability may, in some cir- 
cumstances, yield a coefficient of correlation that is somewhat too 
high. In calculating reliability, an assumption is that the operations of 
chance factors are uncorrelated and hence will cancel out one another. 
But in using the split-half method, both obtained measures are deter- 
mined at the same sitting and any chance fluctuations due to tempo- 
rary conditions within testees and to conditions in the external situ- 
ation will operate in the same direction and thus yield a somewhat 
higher correlation coefficient than might be found by other methods. 

Generally, for split-half reliability, the subdivision is made by taking 
the odd-numbered items as one part of the test and ihe even-numbered 
items as the other. The score is then found for each person, for each 
of the subdivisions. (This method is referred to as odd-even reliabil- 
ity.) Since the correlation coefficient for the two sets of scores derived 
by this method is based upon subdivisions of the full test, each of 
which is half the length of the whole, a statistical formula (Spearman- 
Brown) is used to correct for the reduced lengths of the subdivisions 
from which the correlated scores have been determined. The reason 
for this correction is that the score of the whole test, being based upon 
a larger number of items, is a more adequate sampling of traits or 
functions and hence reduces the possible effects of chance solutions 
and accidental errors. The whole test is thus more reliable than its 
subdivisions; and the correction formula is intended to indicate what 


the reliability of the entire test would be, based upon what was found 
with the part scores. 


An example will demonstrate how the Spearman-Brown formula 
operates, The generalized formula is: 
ae nr 
TEC De 
in which r is the coefficient of reliability obtained between the parts of 
the divided test; r, is the reliability of the test n times as long as half 
the original test. 


In the method of odd-even reliability, n is 2, since the original test 


Reliability 15 


has been divided in two equal parts. Assuming, then that the odd-even 
coefficient (r) is .80, and substituting the values in the formula, the 
reliability of the whole test (r,,) is found to be .89. This estimated re- 
liability coefficient for the test as a whole is the one usually reported 
in psychological research and in test manuals. 


1.00 


@ 
o 


D 
o 


RELIABILITY COEFFICIENT 
b o f. 
[e] 


iv 
° 
T 


5 2 4 6 8 10 


MULTIPLES BY WHICH TEST LENGTH IS 
INCREASED OR DECREASED 


1.1, Changes in Reliability with Changes in 
Length of Test, as Predicted by Spearman-Brown 
Formula. Unit Length Reliability is 40. From 
L. L. ‘Thurstone, The Reliability and Validity of 
Tests. Ann Arbor, Mich; Edwards Bros., 1935. 


FIG. 


The Spearman-Brown formula may be used to estimate the effect 
Upon reliability of a test of a given length if it should be increased by 
any multiple (say, 3 or 4 times) or decreased by any fraction (say, 1⁄2 
or 4%), There is a point of diminishing returns, so to speak, beyond 
which the very small increase in the reliability coefficient, resulting 
from increase in length, does not warrant the extension of a test. (See 
Figure 1i FR) Figure. 1.2 illustrates increase in test reliability as the 
length of a test is doubled. This figure demonstrates what happens 
When reliability is calculated by the split-half method and then cor- 


tected by the Spearman-Brown formula. 


16 Basic Theoretical Principles 


Selecting odd-numbered items as one half of the test and even- 
numbered items as the other half is justified on these grounds: items 
in most tests (as will be seen) are grouped together according to type 
(number sequences, vocabulary, etc.) and are graduated according to 
difficulty, from easiest to hardest. Thus, when this systematic arrange- 
ment is employed, the odd-even procedure yields very close approxi- 


0- 01 02 03 04 05 06 07 08 09 10 
1.0 — 10 


amaA 


09 199 
08 08 
i 
2 azl a 
È os 06 
= os 05 
6 
> 04 04 
2 03 03 
z 
0.2 02 
0.1 | tou 


0 
0 Ol 02 03 04 OS 06 07 08 09 1.0 
RELIABILITY OF HALF THE TEST 


. 1.2. Showing the Increase in Reliability of the 
W hole-Test Scores as a Function of the Reliability of 
Half-Test Scores, when the Spearman-Brown Formula 
Is Applied. 


mations to equivalent half-scores, because each half-score is based 
upon the same types of items and the same number of each type; and 
each half-score is based upon items which progress in difficulty in 
approximately the same degree. For example, consider the first ten 
items of a single type (known as a subtest), say, verbal analogies. 
Numbers 1, 3, 5, 7, 9 are, asa group, of approximately the same total 
difficulty as numbers 2, 4, 6, 8, 10—if they are graduated in difficulty 
from 1 to 10; for both the odd-numbered and the even-numbered in- 


clude items from practically the entire range of difficulty represented 
from numbers | to 10. 


Reliability 17 


There are other methods of selecting items for getting part scores; 
but since they are not too frequently encountered, they will not be 
described here. Whatever method is used to get equivalent half-scores, 
the procedure must be based upon the psychological rationale and 
upon the format of the particular test under consideration, in order 
to insure, as far as possible, equivalence of items in respect to mental 
Processes involved and in respect to difficulty, as well as number of 
items in each part. 

The split-half method tends to overestimate the predictive relia- 
bility of an instrument, since the correlation is not affected by the 
Ordinary conditions that cause normal fluctuations in a person’s per- 
formance on different days. In particular, this method should not be 
used in estimating the reliability of a pure “speed test,’—by which 
we mean a test “whose items are of the same degree of difficulty 
throughout and which therefore measures only rate of performance 
at the given level of difficulty. Since all items in the test are of equal 
difficulty, an examinee should do as well with any one item as with 
any other. Hence, to measure rates of performance and to differentiate 
among individuals, the time limit and the length of the test should be 
Such that no one is able to complete all the items, Under the circum- 
Stances, except for chance errors in performance, the odd-even corre- 
lation should be +1.00 (perfect positive), because the test is, pre- 
Sumably, uniform throughout and the psychological function being’ 
Measured (speed) is operating uniformly on all items. It is apparent, 
thus, that the total scores on the odd-numbered items should equal 
those on the even-numbered. One test manual, for example, reports an 
Odd-even reliability coefficient of .99+, but the manual also reports a 
Coefficient of .88 when the scores of two equivalent forms were used. 

The best practice is to use the test-retest method with a highly 
Speeded test, Tests differ in respect to the significance of speed of 
Performance, even when the items are also scaled in difficulty. As the 
Tole of speed in a test decreases, the odd-even correlations will differ 
less and less from those obtained with the test-retest method. 


The standard crror of measurement. This index is an estimate of the 
deviation of a set of obtained scores from their “true” scores.° It is 


* Differential Aptitude Tests, Manual, page C-6, New York: The Psycholog- 


Corporati 
ation. E 1 
SA rte” score is the measure that is quite free from and uncontaminated 


by Chance factors and errors of measurement; theoretically, it represents an 
individual's true level of performance on the test being used. 


ical 


18 Basic Theoretical Principles 


dependent upon the standard deviation of the distribution of obtained 
scores and upon the coefficient of reliability of the test from which 
the distribution of scores was obtained. The formula for determining 
the standard error of measurement is written: 

SE meses = SDN1 T Far 
in which SD, is the standard deviation of the distribution of the ob- 
tained scores, and r,, is the reliability coefficient of the test. 

Assume that the standard deviation of a test (SD,) is 12 IQ points 
and that its coefficient of reliability (r,,) is .90. Substituting these 
values in the formula, we find that the standard error of measurement 
is approximately 3.8 points. This statistic is interpreted as follows: 
assuming that the test scores are normally distributed and that the 
“errors of measurement” are similarly distributed, then approximately 
68 percent of the obtained scores are within 3.8 points of the true 
Scores for the persons measured, Otherwise stated, the odds are 68 
out of 100 (or 68 to 32) that a particular individual's obtained score 
is in error by 3.8 points or less. Then using the table of probabilities 
for standard deviation values, we can say, further, that the probabil- 
ities are 19 to 1 (95 in 100) that the error of measurement will be 
7.6 points (twice the standard error of measurement) or less; and 
99 to 1 that it will be 9.5 points (21⁄2 times the standard error of 
measurement) or less. 

The foregoing technique gives us the means of estimating an in- 
dividual’s “true” score from a set of obtained scores, of which his is 
one. Using the data of the illustration above, assume an individual’s 
obtained IQ is found to be 100. The probabilities are, then, two to 
one that his “true” IQ lies between 96.2 and 103.8. (For practical 
purposes we would say between 96 and 104.) And the prob 
are nineteen to one that it lies between 92.4 and 107.6. 

Obviously, the higher the test’s reliability coefficient, the smaller 
will be the error of measurement, and therefore the greater the pre- 
dictive value of the test. The standard error of measurement provides 
us, also, with a basis for judging whether or not the scores for two 
persons represent a true difference or whether they are only devia- 
tions from the same, or nearly the same, true scores. For example, 
if one person gets an IQ of 100 and another person gets one of 96, 
are these within the range of the same true score or are they sig- 
nificantly different, statistically? Using the data of the illustration 
given above, we say they are within the range, and they are not sta- 


abilities 


Reliability 19 


tistically significant. Also, quite aside from any question of statistical 
significance, a clinical psychologist knows from experience with de- 
tails of test performance that no psychological significance attaches 
to a difference between IQ’s of 100 and 96, or to similar differences 
elsewhere on the scale. 

It is clear that while it is essential to have the reliability coefficient 

for a test, as an estimate of relative reliability, it is equally essential 
to have the error of measurement as an estimate of absolute reli- 
ability.” 
Analysis of variance. As already stated, the degree of reliability of 
a test depends upon the extent to which variations in scores of the 
testees are attributable to “true” differences among the individuals 
constituting the group, and the extent of inaccuracies of measure- 
ment. A test is unreliable in proportion to the variation of results 
attributable to factors of test inaccuracy, rather than to “true” dif- 
ferences among the members of the group. The estimate, in the scores 
of a group, of the proportions of variation due to each of the several 
factors, is technically known as analysis of variance.’ 

In a study of intelligence-test reliability by this method, we would 
ask: what factors may be important, and to what extent, in producing 
the obtained differences of scores on two applications of the identical 
test (or of equivalent forms) to the same group of persons? First: 
Since individuals differ in any population sample, the analysis should 
estimate the extent to which obtained differences in scores are due to 
“true” differences in the functions being measured. Second: if there 
is some general improvement of scores on the second test, it would 
be necessary to estimate the “practice” effect. Third: since the two 
foregoing factors would, in all probability, not account for all differ- 
ences in scores, it is assumed that there are “residual differences” due 
to errors of measurement attributable directly to the test being used; 


7 The SE (meas.) is an over-all index, theoretically applicable throughout 
the range of scores. It sometimes happens, however, that a test measures with 
ess “error” at some parts of the scale than at others. In that case, it is possible 
to determine at which parts of the scale the “errors of measurement” are larger 
Or smaller. On the Stanford-Binet Scale (1937), for example, the SE (meas.) 
is 5.2 points for 1Q’s above 130, but only 2.2 points for IQ's below 70. 

8 Variance is defined as the mean of the squared deviations from the mean 
Score of the group. A measure of deviation is an index of the extent to which 
individual scores of a group vary from the group’s average score. Variance is 
the statistical term for the square of the standard deviation. 


20 Basic Theoretical Principles 


that is, weaknesses or defects within the test. If additional influencing 
factors could be isolated, their significance in producing the obtained 
scores would also be determined. Those factors that cannot be isolated 
and separately analyzed remain as “residual” factors. 
Analysis of variance as a method of estimating reliability is pre- 
ferred by some psychologists, but it has not been widely used.” 
Reliability is also evaluated at times by means of statistical devices 
with which may be calculated consistency of performance from item 
to item within a test. This method introduces the assumption that 
the test is completely homogeneous as to functions measured; that 
is, that each item in the test measures precisely the same composite 
of mental functions as every other item. In most tests this is a doubt- 


ful assumption; but if the assumption is warranted, the technique may 
be used. 


Factors affecting the interpretation of reliability coefficients. In addi- 
tion to the considerations mentioned in connection with the several 
methods of estimating reliability, there are other factors that must be 
taken into account in interpreting reliability findings. ' 

Range of ability of the group tested affects the reliability coefficient. 
If a reliability coefficient is found with a group that has a relatively 
small variation of the trait or function being measured, the coefficient 
‘vill be relatively low. If the group has a wider range in the trait or 
function, the coefficient will be higher. (See Figure 1.3.) Thus, a test 
having high reliability for a widely varying group does not neces- 
sarily have equal reliability for a significantly more homogencous 
group of persons. The reasons for this fact are several, one being the 
nature of the correlation Process and the elements in the correlation 
formula. 

For illustrative purposes, suppose that we are dealing with a com- 
pletely homogeneous group of individuals, with respect to one meas- 


ure: namely, chronological age. Assume that everyone in the group 
is exactly ten years of age. If they are an adequately representative 
9 Since analysis of variance as a meth i i jability. requi 

mane s od of estimating reliability requires 
more han, eT Pi elementary statistics, it will not pa further elaborated 
here. See igh AES and G. A. Ferguson, Studies on The Reliability of 
Tests, Toronto: Department of Educational Research, University of Toronto, 
1941. Also C. Hoyt, “Test Reliability Obtained by Analysis of Variance,” 
Psychometrika, Vol. 6,1941, pp. 153-160. : 


See G. F. Kuder and M. W. Richardson, “The Theo oer 
Test Reliability,” Psychometrika, Vol. 2, 1937, pp. 151-160, at Estimats gk 


RG 


Reliability 21 


sample of all ten-year-olds, the range in test score might be from 
extremely low to extremely high. In this instance, since there is no 
deviation (or range) whatever in one of the measures (chronological 
age), the correlation coefficient for the two variables (test score and 
CA) will be zero.™ Such an extreme instance rarely occurs, but it 
does demonstrate that when there are possibilities for wide variations 
in one measure (in this in- in 
stance, the test score) and very 
restricted possibilities in the 
other (in this instance, the 
CA) the coefficient is lowered. 
If the age range were two 
years instead of one, the co- 
efficient of correlation would 
still be low, but not zero, be- 
cause in general the members 
of the older group tend to have 
higher test scores than do those 
in the younger. But since there Se i 
is a wide range of capacity STANDARD DEVIATION OF SCORES 
within ach Broun an. over rig. 1.3. Curve Showing Increase in 
lapping of capacity between Test Reliability as Variability of Group 
the two age groups, the coef- Increases. 


ficient will be low. . 
A correlation coefficient reflects the group trends of the measures. 


As persons increase in age, mental capacity increases until maximum 
development is reached. The correlation coefficient will reflect this 
fact, But since there are wide differences in capacity within any age 
group, and since there is considerable overlapping of capacity even 
among rather widely separated age groups, the coefficient will be af- 
fected by these facts also. The result will be a coefficient lower than 


.60 


40) 


COEFFICIENT OF RELIABILITY 


20) 


+1.00 (perfect correlation). =o. | 
Thus, in correlation estimates of reliability, if the age range is wide, 
S, 


fn Inspection of the product-moment correlation formula will show this to be 
he case: 3 
J a =(xy) 
r= NSD, SD,) 
in which Sxy is the sum of products of the deviations of the paired scores; 
SD, and SD. are the standard dgagg f the two sets of measures. 
- v 


ao Basic Theoretical Principles 


the group trends in scores (higher scores with higher ages) will have 
increased weight, as compared with a narrower age range in which 
the age trend has less weight. In interpreting a reliability coefficient 
of a test, therefore, it is necessary to know the range of ages upon 
which the test was standardized. 


TABLE 1 


Raw Scores and Ranks of Students 
on Two Forms of an Arithmetic Test 


Form X Form Y 
Student Score Rank Score Rank 

A 90 1 88 2 
B 87 2 89 1 
G 83 3 76 5 
D 78 4 77 4 
E E 5 80 3 
F 70 6 65 7 
G 68 T 64 8 
H 65 8 67 6 
I 60 9 53 10 i 
J 54 10 57 9 
K 51 11 49 14 
L 47 12 45 14 
M 46 3 48 12 
N 3 M 47 3 
O 39 15 44 15 
P 383 16 42o 16 
Q a 3 17 
R 30 18 34 20 
Ss 29 19 37 18 
I 25 20 36 19 


Just as, in the foregoing illustrations, correlation coefficients were 
shown to be lowered by homogeneity in one of two variables, so in 
estimating reliability the coefficient will be lowered by restricting the 
group’s range of variation in the trait being measured. An illustration 
will help to clarify this matter.” 

“In Table 1 are shown the raw scores and rankings of twenty 
students on two forms of an arithmetic test. Looking at the two sets 


1? From Test Service Bulletin, The Psychological Corporation, No. 44, May 
1952. 


43 


Reliability 23 


of rankings, we see that changes in rank from one form to the other 
are minor; the ranks shift a little, but not importantly.” A coefficient 
computed from these data is very high: r = .968. 

“Now, however, let us examine only the rankings of the five top 
students. Though for these five students the shifts in rank are the 
same as before, the importance of the shifts is greatly emphasized. 
Whereas in the larger group Student C’s change in rank from third 
to fifth represented only a 10 percent shift (two places out of twenty), 
his shift of two places in rank in the smaller top group is a 40 per- 
cent change (two places out of five). When the entire twenty repre- 
sent the group on which we estimate the reliability of the arithmetic 
test, going from third on form X to fifth on form Y still leaves the 
student as one of the best in this population. If, on the other hand, 
reliability is being estimated only on the group consisting of the top 
five students, going from third to fifth means dropping from the mid- 
dle to the bottom of this population—a radical change.” A coefficient, 
if computed for just these five cases, is 50 (rho).™ 

“Note that it is not the smaller number of cases which brings about 
the lowér coefficient. It is the narrower range of talent which is re- 
sponsible, A coefficient based on five cases as widespread as the 
twenty (e.g., Pupils A, E, J, O, and T, who rank first, fifth, tenth, 
fifteenth. and twentieth respectively on form X), would be at least 
as large as the coefficient based on all twenty students.” [rho = 
+1.00] 

Furthermore, when the variation among testees is narrow, the cor- 
relation between two sets of scores may also be lowered by chance 
factors and minor psychological factors. Since individuals in such a 
group are closely clustered—that is, their true differences are small— 
the changes in scores and relative positions produced by extraneous 
factors are more significant than they would be in a widely divergent 
group. 

This illustration makes clear the fact that reliability coefficients of 
a given test may vary as the composition of the tested group changes, 
even though the performances of the testees themselves are un- 
changed. Thus reliability data may show that a test discriminates 
Satisfactorily over a wide range of the trait or capacity measured; but 
reliability may still be inadequate where finer and more precise dis- 


Rho represents the “rank-order” correlation coefficient. It approximates 


Closely the product-moment coefficient (r). 


24 Basic Theoretical Principles 


criminations are necessary among individuals who vary within a nar- 
this piacia significance of range and hence of ability is this: in 
standardizing a test, its author must determine reliability with a group 
that is similar in average level of ability and in variation of scores to 
the group with whom the test is to be regularly used. The user of a 
test should select an instrument that, among other things, provides 
reliability data based upon a sampling of persons who resemble closely 
the group of individuals he desires to test and study. ; y 
The time interval between testings may be significant in the inter- 
pretations of reliability findings. When reliability estimates are based 
upon odd-even correlations (internal consistency), or upon the Scores 
of two equivalent forms of a test administered at a single sitting or 
within the same day, the results are uniformly affected by the exam- 
inees’ physical condition and attitudes, and by the prevailing environ- 
mental conditions during the testing. Such uniformity means that the 
factors external to the test itself are likely to affect both sets of test 
scores equally, thus increasing the degree of similarity of each per- 
son’s two scores. This condition tends to give higher reliability co- 
efficients (that is, gives higher estimates) regarding the instrument’s 
predictive value than would be the case if the retest were given after 
a time interval. 
| When there is a time interval, the retest results will be affected by 
the normally expected fluctuations in individual performances and 
by changes in environmental conditions. Thus, while test results and 
reliability coefficients obtained at a single sitting or in a single day 
are most likely to estimate best the consistency of the instrument it- 
self, they do not indicate stability of performance over a period of 
time as well as do coefficients obtained by the test-retest method, using 
a time interval. Conversely, the test-retest method is the more likely 
to underestimate the internal consistency of a test, because factors 
extraneous to it may affect the scores dissimilarly. The extent to 
which the accuracy of a test is underestimated by the test-retest 
method will depend upon the degree to which effective influencing 
conditions are inconsistent. If the time interval has been quite long, 
especially in the case of young children—perhaps three months or 
more—an individual’s retest results may be influenced by peculiarities 
of his growth tempo, or by other more or less enduring conditions 


such as emotional experiences, which may affect persons of any age. 


Reliability 25 
The longer the interval between tests, the more likely the lowering 
of the correlation due to intervening factors. 

The effects of practice and learning during the interval will depend 
upon the content of the test being used and upon the examinee’s ex- 
periences during the interval. For example, if some months have 
elapsed between two administrations of an educational achievement 
test, different pupils may have had different amounts and qualities of 
instruction during the period. The retest scores would, in part, reflect 
this instructional difference; thus the correlation coefficient would not 
be solely a reliability coefficient. Or, in the case of a personality test, 
individuals in therapy or after extensive counseling may have modi- 
fied their attitudes, values, and behavior sufficiently to produce sig- 
nificant differences in test-retest results. 

Which method of estimating reliability is preferable depends upon 
the problem at hand. Psychologists and educators are usually con- 
cerned with knowing: (1) the internal consistency of a test, and (2y 
the predictive value of a test when it is subject only to the minor or 
accidental changes in conditions from day to day, rather than to 
fundamental changes resulting from permanent or semi-permanent 
changes effected by learning, developmental idiosyncracies, or disturb- 
ing emotional experiences. For the first purpose, the odd-even method, 
or the test-retest method, the tests being given at one sitting or within 
one day (using equivalent forms), is preferable. For the second pur- 
Pose, the test-retest method, the tests being given within a week or two 
(using the same form or equivalent forms), is the preferable one. Un- 
der testing conditions that are not too different, the results of the 
second method will not be far removed from those obtained with the 
first. A test manual should provide information regarding internal 
Consistency and test-retest results. 

Sub-test reliability is not always equal to total test reliability, It has 
already been explained that, other factors being equal, the reliability 
of a test increases with increase in length, although not in direct pro- 
Portion. This principle applies to those scales that consist of several 
different parts (called sub-tests), each of which utilizes a different 
type of content. Nearly all group tests are of this kind, as are some 
of the individual scales (¢.g., the Wechsler). For these instruments, 
the total test reliability is higher than that for each of the subtests. It 
is erroneous, therefore, to assume that the reliability coefficient for 
the whole may be applied to a part. For example, the Wechsler In- 


26 Basic Theoretical Principles 


telligence Scale for Children shows a full scale (nine sub-tests) reli- 
ability of .92 for a group of 200 children 7% years of age, the ca- 
efficient having been calculated by means of the split-half technique. 
Yet the reliabilities of the individual sub-tests, for the same group of 
children, ranged from a low of .59 to a high of .84. It is obvious that 
the total score is a more dependable index of the function or func- 
tions being measured than is any sub-test score. o 
Consistency of scorers is a factor in calculating test reliability. Some 
tests (such as the Stanford-Binet and, in particular, projective tech- 
niques) are not entirely objective in scoring, since the examiner at 
times finds it necessary to judge the correctness or quality of re- 
sponses, For tests such as these it is necessary to know the extent of 
agreement in scoring found among two or more competent persons 
who have scored the same sets of responses, Test authors usually re- 
port such data in their manuals; and, in addition, other psychologists 
will have carried out and reported studies on this problem. Lack of 
agreement among scorers will adversely affect the reliability findings. 


\ 
/ VALIDITY 


An index of validity shows the degree to which a test measures 
what it purports to measure, when compared with accepted criteria. 
The construction and use of a test imply that the instrument has been 
evaluated against accepted standards or other criteria which are re- 
garded by experts as the best evidence of the traits or 
measured by the test. Selection of Satisfactory validation criteria and 
demonstration of an appropriate degree of validity is fundamental in 
psychological and educational testing. 

The first necessary condition of a valid test is that it have an ade- 
quate degree of reliability. If the reliability coefficient of a test is zero, 
it cannot correlate with anything. A test that correlates poorly even 
with itself cannot correlate well with a measure of another variable. 


abilities to be 


Operational and Functional Validity. It is useful to recognize two 
kinds of validity, although they are not mutually exclusive. The first 
is known as operational validity; the second is functional validity. 

By operational validity we simply mean that the tasks required by 
the test are adequate for the measurement and evaluation of certain 


1 Manual, New York: The Psychological Corporation, p. 13. 


Validity 27 


specified and defined psychological operations. For example, the Sea- 
shore Measures of Musical Talent are actually tests of only certain 
essential auditory aspects of musical talent, but not of “musical talent” 
which psychologically involves much more than these auditory aspects. 
Insofar as the Seashore tests differentiate correctly between persons 
in regard to the specified auditory processes, they are operationally 
valid. On the other hand, these measures are functionally valid to the 
extent that they are efficient in predicting subsequent development of 
various degrees of skill and competence in the several aspects of 
music. Thus, the functional validity of a test is the extent to which it 
is efficient in predicting and differentiating behavior or performance 
in a specified area under actual working and living conditions. 

Numerous other examples can be cited to illustrate the difference 
between functional and operational validity. Thus, a peg-board test 
(placing small metal pegs into a perforated board) may well measure 
manual and digital dexterity (operational), but it might be only 
slightly useful in predicting “mechanical ability” (functional), Again, 
a word and number checking test may be quite satisfactory as a meas- 
ure of perception of details (operational), but it might have limited 
value in predicting “clerical ability.” 

It is obvious that functional validity is dependent, at least in part, 
upon the operational validity of the test. The reason is that the psycho- 
logical operations required by the test have been included because 
they have been found to be essential in certain actual situations in 
which testees will or might be placed. Hence, if the psychological 
operations themselves are not measured with adequate validity, pre- 
dictions of later performance will be adversely affected. 

4 ' a 
© Criteria of Functional Validity. The problems of selecting and utiliz- 
ing satisfactory validating aian oa Palle ona kinds of tests. 
In ing eneral ability (intelligence), a common prac- 
PB cee roel 3 of the following: scholastic marks, teachers’ 
judgments of individuals’ abilities, cumulative scholastic averages over 
a period of years, number of school grades completed, chronological 
age, and known groups. The reasons for using these as criteria are: 
(1) that scholastic records are evidence of mental ability even though 
influenced by factors other than intellectual ability; (2) that teachers 
are in a position to evaluate individual ability with some validity, be- 
Cause they observe their pupils over a long period and are able to 


28 Basic Theoretical Principles 


make inter-pupil comparisons; (3) that cumulative scholastic aver- 
ages are more valid than marks or estimates of a single teacher be- 
cause they represent combined judgments of performance over a 
longer period of time; (4) that on the whole the more able persons 
complete more formal education and reach higher levels in school 
and college; (5) that as individuals grow older their levels of in- 
telligence increase until adult maximum is reached; (6) that definitely 
known groups, such as gifted, somewhat superior, mentally deficient, 
slow learning, and average groups will show differential performance 
on a valid test. 

The principal criteria in standardizing tests of specific aptitudes 
(e.g., mechanical, musical) are marks in training courses and differ- 
entiation of known groups possessing the aptitude in varying degrees. 
An example of known groups would be those working efficiently at 
each of several levels of a mechanical occupation and those in non- 
mechanical occupations. It is highly desirable, of course, to use degree 
of success of actual performance in the vocation as an ultimate cri- 
terion. 

When the criterion of actual performance on the job is used, the 
following kinds of ratings are obtained: ratings by supervisors, evalu- 
ation of the quality of the product, and rate of work. However, the 
most frequently employed criteria for tests of specific aptitudes are 
marks and ratings in training courses; that is, criteria of capacity to 
learn the given skill or the profession, since aptitude tests are used 
largely to select individuals for training or education in the specified 
areas, although their use in employee selection is not inconsiderable. 

In personnel work, in business and industry, where specialized tests 
are used to select individuals for specific jobs, it is possible, indeed 
essential to use actual production records or performance ratings as 
criteria of test validity. If, for example, a personnel department wants 
to know whether certain measures will identify the potentially best 
stenographers, the tests might be administered: (1) to a group of 
employees of several quality levels to estimate the instruments dif- 
ferentiating efficiency; and (2) to newly employed personnel whose 
performance records, after an adequate period, would be correlated 
and otherwise analyzed against their test scores. 

Tests of educational achievement are validated against school 
marks and teachers’ ratings. Frequently, also, the criterion is “content 
validity” rather than some external standard. “Content validity” 


Validity 29 


means simply that the author of the test has determined upon its con- 
tent by means of an analytical process and his own judgment, as well 
as that of experts in the subject-matter, as to what is appropriate and 
germane. For example, in constructing a test of American history, 
the author examines what he believes to be representative textbooks, 
consults teachers of history, decides which topics are most significant 
and what their relative weights should be, and devises items he be- 
lieves are most representative of these. 

Tests of personality traits present an especially difficult problem in 
validation, Often the author of the test uses “face validity.” When- 
ever an author bases his test upon his own analysis of what is to be 
evaluated or measured, without reference to prescribed content, as in 
textbooks. or without subjecting his device to comparison with other 
external standards, he is using “face validity” as his criterion. At 
times, the traits presumably being measured by a particular test of 
personality have been included only by fiat. The sounder tests in this 
category, however, are validated against actual behavior of the sub- 
jects and against clinical diagnoses. But even these criteria present 
difficulties because they are themselves affected to an appreciable de- 
gree by the subjective judgments of the persons making the evalua- 
tions of behavior or the clinical diagnoses. These matters will be 
discussed further in subsequent chapters. 

Criteria of validity may be immediate, intermediate, or ultimate. 
The use of marks in a particular course of study as a criterion in 
validating a test of specific aptitude is a case of an immediate cri- 
terion. The cumulative average marks in an entire training curriculum 
can be regarded as the intermediate criterion, if performance on the 
job itself is the ultimate criterion. If a test is being designed solely for 
the purpose of predicting marks in a single course (say, geometry), 
then those marks are the ultimate criterion. Whether a criterion be- 
longs in one or another of the categories depends upon the purpose 
of the test and the number of phases or steps that are available for 
use as criteria. In fact, when validity findings on a test are to be inter- 
Preted, the purposes for which the instrument is designed must be 
taken into consideration. A particular test may have different validities 
for different purposes, different age groups, different sex groups, etc. 
The validity of a test lies in the correctness with which it measures at 
the time it is administered and in its predictive value for specified 


Activities by specified groups. , 


30 Basic Theoretical Principles 


Factorial Validity. This method utilizes factor analysis techniques 
that are not within the scope of the present discussion. Factor analy- 
sis theory, however, is discussed in Chapter 3, in connection with 
theories of intelligence. Yet, since factorial validation is a method 
used with some tests, students should be familiar with the general 
nature of the theory.” 

Most tests of mental ability and of personality sample a composite 
of performances such as verbal knowledge and facility, number fa- 
cility and quantitative reasoning, memory span, concept formation, 
etc, Factor analysts maintain that these and others, especially when 
represented by a single composite index (such as mental age or in- 
telligence quotient) are not “functional unities.” Analysts urge that 
they are not measures of a “pure” ability; that is, just one type of 
ability uncomplicated by others. Thus, according to this theory, a 
test is said to have high factorial validity if it is a measure of one 
“functional unity” (e.g., word knowledge) to the exclusion of other 
elements as far as possible. The factorial process aims to identify, 
by the method of intercorrelations and further statistical analysis, a 
list of “functional unities” (also called “primary mental abilities”) 
within a test and the weight contributed by each of these to total 
performance on the test. The ultimate goal is to devise tests each of 
which will measure only one “functional unity” and be relatively in- 
dependent of others (that is, show quite low intercorrelations). Such 
“pure” tests would then be used singly; or they might be used as sub- 
tests in a comprehensive measuring instrument; but even then each 
subtest is scored and rated independently for the purpose of obtain- 
ing a psychological profile for each person. 

If validation stops at factorial validity, “operational” validity has 
been established. But if, after factorial validity is established, we 
proceed to validate the test against criteria of later performance in 
working situations, we are making a “functional” validation. The prin- 
cipal contribution of factorial validation is this: instead of validating 
the total, undifferentiated instrument against functional criteria, an 
effort is made to identify the component psychological elements and 
to establish their relative independence, and finally, to correlate these 
elements separately against functional criteria. 


Such analysis into psychological “unities,” or elements, is of value 


15 See J. P. Guilford, “Factor Analysis in a Test-Development Program,” 
Psychological Review, Vol. 55, 1948, pp. 79-94. 


Validity 31 


when individuals are to be selected for specialized work or study and 
their performance predicted therein. For example, since “mechanical 
aptitude” is not a simple, unitary skill, it is valuable to be able to 
identify which psychological elements have most predictive value for 
a specified type of work. Mechanical work may involve a high degree 
of spatial perception in one situation but not in another; or manual 
precision and speed; or comprehension of mechanical principles. Also, 
higher than average intelligence is desirable for the practice, let us say, 
of both law and engineering. In the former, word knowledge and 
verbal concept formation are the more significant, whereas in the 
latter spatial perception and quantitative reasoning are more sig- 
nificant than word knowledge. Factorial analysis can assist in identify- 
ing the more limited and immediately relevant aspects of ability re- 


quired in a given occupation or activity. 


Face Validity. This is a term that is used to characterize test ma- 
terials which appear to measure that which the test author desires to 
measure, Use of the term “validity” in such instances is hardly war- 
ranted, for the materials have not been objectively analyzed for va- 
lidity. In instances of face validity the author and those using the 
test simply assert in effect that the content of the test appears to be 
appropriate and to serve their purposes. Face validity is found most 
often in personality inventories and in some of the more recently pub- 
lished projective tests, notably the Szondi test (described in Chapter 
19), This type of “validity” is also found at times in methods used 
in the selection of industrial personnel. It is, however, unwarranted 
and should not be resorted to unless a relatively objective approach is 


impossible or not feasible. 


Cross Validation. This term refers to the process of validating a test 
by using a population sample other than the one on which the instru- 
ment was standardized. The reason for using this method is that at 
times the original validity data may be spuriously high due to the 
Operations of some chance factors that produce a higher correlation 
than is warranted. As a matter of fact, however, once a test is put 
to use in a variety of situations and by many different persons, it is 
being constantly cross-validated; and if it does not prove to have high 
enough functional value, its use will be, or should be, discontinued. 


The most frequently used technique 


Methods of Calculating Validity. 
le correlation of test scores with each 


of estimating validity is the simp 


32 Basic Theoretical Principles 


criterion. A coefficient of a given magnitude cannot be arbitrarily 
specified as signifying or as not signifying validity. Whenever a valid- 
ity coefficient is positive and significant, it has some value. In some 
instances, coefficients of only +.30 have been found useful. Most 
coefficients, however, should be larger. 


ERROR SCORES ON STORE PERSONNEL TEST, FORM FS 


solur |as] n Jaa |as]22]29|26]22)20) 7] | 7] 2] se] 2 
= PPT PrP apr Peep TE T pep ep ey by ay ay 
2 ys |us|e2|39|36 |33 |s0]27]24| 27 [va fes| 2] 7 ]e6 | 3 | o 
Š Jael IFS 
= 2j2jt AE 
$ ME CAERA EE 
5 
i ja |e isle |e 22 
3% 1 ALACA LALAEAL 
È 
Ep |_| lr lal¢l2l2|2l2|7 
4 er 
p 3|ļ|3|/|2|2 |4 
a 2 AE a| alg 
o (| (eI. DETTE |e 
i ‘sommes 
$ LLG Lele ie Le 
i alae 3 113 
a / 1 
9 — 

E / 


ric. 1.4. Chart for Pearson product-moment correlation between num- 

ber of errors made by 155 grocery store trainees on Part II of the experi- 

mental Store Personnel ‘Test, Form FS, and ratings made by the 
training staff. r = 46. , 


Figure 1.4 illustrates the simple correlation method." The test 
scores, shown horizontally, were correlated with trainer ratings, 
shown vertically. The number in each cell shows how many persons 
earned the scores of that cell, as indicated on both axes. For example, 
two persons who made between 21 and 23 errors on the test were 
given trainer ratings of 12; then going to the bottom of the same 
column, we find that one person who also made between 21 and 23 
errors had a trainer rating of 2. For this sampling of examinees, the 


16 From Test Service Bulletin, No. 37, 1949, New York: The Psychological 
Corporation. The data of biserial and tetrachoric correlations that follow are 
also from this bulletin. 


Validity 33 


coefficient is .46, which is well within the range of validity coefficients 
most often found for a single criterion. 

The biserial correlation coefficient is used when one of the criteria 
is rated in terms of only two categories: e.g., “pass” or “fail”; “satis- 
factory” or “unsatisfactory.” The second measure, however, is given 
in terms of variable scores. This method is used when the situation 


TABLE 2 


Biscrial correlation between scores of 52 employed 
stenographers on the Scashore-Bennett Stenographic 
Proficiency Test and their supervisors’ ratings on 
stenographic ability. Ti. = 60. (The Psychological 


Corporation ) 
Ratings on Stenographic Ability 
Test Below Above 
Scores Average Average Average Excellent 
19 1 3 2 
18 2 3 5 
7 1 2 2 
16 l = 2 
15 8 5 
14 2 E 
3 ? a 
12 J J 
1] = = 
10 2 2 1 
9 gs a 
8 4, E = AR 
Subtotals 3 = 23 5 = 11 
Totals 26 26 
(Group 1) (Group 2) 


aluation, as in the illustration presented in 
he four groupings on the basis of super- 
Visors’ ratings (below average, average, above average, excellent) 
have been reclassified into two categories (Groups 1 and 2) which 
have been correlated with stenographic proficiency test scores. The 
biserial coefficient of .60 indicates that the proficiency test has con- 
Siderable value in identifying stenographers who will function at satis- 


factory or highly satisfactory levels. f : 
f correlation is an index that is found 


The tetrachoric coefficient 0. ; 
When a coarse classification of two measures 1 adequate for the pur- 


requires only a rough ev 
Table 2. Here we see that t 


34 Basic Theoretical Principles 


pose at hand. When this index is used, the ratings in each measure 
are grouped into only two classes, providing a “four-fold” table. The 
data in Table 2 have been so reclassified in Table 3, yielding a tetra- 
choric coefficient of +.60. 

Whether one uses the finer classifications necessary for calculating 
the product-moment coefficient (the “simple” correlation above) or 
uses the coarse groupings shown in biserial and tetrachoric calcula- 
tions will depend upon the nature of the data available and upon the 
purpose for which validation is to be used. 


TABLE 3 
Four-fold table for computation of tctrachoric correla- 
tion coefficient. r,,, = .60. (The Psychological Corpo- 
ration) 
Ratings 
3-5 6-8 
is 6 19 
g 7 a 9 High 
S 4 (11.5%) (36.5%) on Test 
& pt ell) ——s 
5 in . 17 10 Low 
=T (32.7%) (19.39%) on Test 


Rated Low Rated High 


Multiple correlation is a method whereby two or more criteria 
statistically combined and correlated with the test score to jeld a 
single coefficient. Whereas the simple product-moment fh Ree ise 
dicates the degree of relationship (or co-variation) between two sets 
of measures, the multiple correlation coefficient shows the relationshi 
between one set of measures (in this instance, test scores) aii ihe 
composite of two or more other sets of measures (in this instance, the 
criteria). In other words, while a test might have a low or mada 
correlation with a single criterion, it can have a quite si nifi a 
correlation with several criteria taken together as a ei This 
is so because the several criteria in combination have more = ate 
or factors in common with the test than does any one factor taken 
singly. 

Expectancy tables provide a relatively simple, straightforward, and 
very valuable method of estimating the predictive efficiency of a test 


are 


Validi 
alidity E 


The estimates are based upon the calculated probabilities that an in- 
dividual who has a given test score will achieve a specified score or 
rating in the performance being predicted. We might ask, as examples, 
the following questions: What are the probabilities that a prospective 
college student scoring in the highest decile group on a “scholastic 
aptitude” (intelligence) test will remain in college a given number of 
terms? What are the probabilities that a child with an IQ of 80-85 
will be able successfully to complete the work of the eighth grade? 
What are the probabilities that a candidate getting an average score 
on a stenographic proficiency test will achieve a rating of “excellent” 
Or “above average” on the job? Appropriate expectancy tables are in- 
tended to answer these and similar questions. 


TABLE 4 


Decile Rank on a Scholastic Aptitude Test 
and Semesters Completed 
(in percents) 


Terms * 
> 3 4 5 6 7 8 
~ xo o5 95 9+ 90 S9 88 88 
S vi 9% 87 85 82 8l 78 78 
= y g g2 V A A B B 
5 n gs 71 66 6l 59 57 56 
ġ l $1 67 60 52 5 49 48 


* Decile rank X is the highest; I is the lowest 

Table 4 is an illustration in point. It presents part of a larger table 
Tepresenting all ten decile groups. 

To take two items from Table 4, we may say the probability is that 
88 in 100 of the students in the highest decile group on the scholastic 
aptitude test will complete their academic course, whereas only 48 in 

90 of the lowest decile group will do so. 
Table 5 illustrates the use of expectancy data in personnel selection. 
‘spection of this table shows that it may be used to indicate what per- 
centage of individuals obtaining each of the several ratings on actually 
demonstrated stenographic ability may be found at each of the several 
levels on the proficiency test. It is also possible to calculate the per- 
centages by rows (instead of columns) so as to indicate the converse: 
namely, the frequencies of the several ability ratings within each of 


© score intervals of the proficiency test. 


36 Basic Theoretical Principles 


Comparison of Tables 4 and 5 demonstrates that expectancy tables 
need not be uniform. The form and arrangement of data will depend 
upon the particular probabilities one desires to determine. But all ex- 
pectancy tables for tests have this in common: they provide estimates 
of the probabilities that a certain level or quality of performance may 
be expected if the test score is known—that is, its functional validity 
in terms of probabilities in place of or, more often, in addition to a 
correlation coefficient. 


TABLE 5 


Expectancy table showing the number and percent of stenographers 
of various rated abilities who came from specified score groups on 


the S-B Stenographic Proficiency ‘Test. (N = 52, mean score = 15.4, 


S.D. = 2.9, r = 61; score is average per letter for five letters.) (The 
Psychological Corporation ) 


Number in each score group Percent in each score group | 
receiving cach rating on sten- . [receiving each rating on sten- 
ographic ability Stenographic ographic ability 

a ——s Proficiency |> = : 
Below Above Test Scares Below Above 
Aver- Aver- Aver- Excel- SE Scores | Aver- Aver- Aver- Excel- 
_age age age lent | age age age lent 
4 6 7 18-19 17 40 64 
A 2 2 4 16-17 9 3 36 
10 5 14-15 44 33 
4 1 12-13 17 7 
2 2 1 10-11 67 9 7 
= l — | 8-9 33 4 
3 23 15 11 1100 100 100 100 


A cut-off score is a special instance of the expectancy method. It is 
a test score that is used as a point of demarcation between examinees 
who will be accepted and those who will be rejected. For example, 
Table 6 shows several values from the Cornell Index (a personality 
inventory discussed in Chapter 17) that might be taken as cut-off 
scores. 


A low score on this inventory signifies fewer personality problems; 
hence it is more desirable. The table reads: If a cut-off score of 7 on 


the Cornell Index were used with this group of 1000 persons, 86 
percent of those who were rejected after the interview would have been 
rejected also by the Index; but 28 percent of those accepted after the 
interview also would have been rejected by the index. The other per- 


Validity a4 
37 


centages are read in the same way. Since the higher scores in this in- 
stance are the less desirable, and if 7 is taken as the cut-off level, we 
mean that a score of 7 or lower would be necessary for acceptance. If 
23 were the cut-off, then anyone with that score or lower would be | 
acceptable, In this instance the cut-off score becomes less selective as : 
it increases. 
TABLE 6 
Percent of Psychiatric Rejects * and Accepts * 
Identified by the Cornell Index (Reprinted 
by permission from the Manual. The 
Psychological Corporation.) 


Cut-off Level 400 Rejects 600 Accepts 


7 86% 28% 
13 7+ 13 
23 50 4 


* Based upon psychiatric interviews 


a test on which larger scores signify higher and more 
ability or trait being evaluated, then the cut-off 
as it is increased, Table 7 is a case 


If we are using 
desirable levels of the 
Score becomes the more selective 


in point. 
TABLE 7 
Percent of Superior and Inferior Workers Identified by 
a Proficiency Test 


Superior Workers Inferior Workers 
Cut-off Score Accepted Rejected Accepted Rejected 


20 100° 0% 80% 20% 
25 90 10 60 40 
3 80 20 40 60 


The hypothetical example shown in Table 7 is interpreted thus: If 
were set as the minimum acceptable score, then all examinees who 

Proved to be superior workers would have been employed; but so 
Would 80 percent of those who proved to be inferior workers. The 
Other cut-off scores are similarly interpreted. 

It is clear that cut-off scores are especially useful in instances where 
Many more candidates are available than there are places to be filled, 
So that the cut-off level may be made highly selective, and where one is 
not concerned primarily with the individual candidates as such, but 


8 Basic Theoretical Principles 
3 


rather with the places, jobs, or niches to be filled. The purpose in using 
cut-off levels is to identify a maximum number of potentially superior 
or desirable persons and, at the same time, to eliminate a maximum 
number of inferior or undesirable individuals. Since no test has perfect 
validity, and since the true potential of some persons may not be re- 
vealed by a single test, screening by means of cut-offs will not be per- 
fect. Some desirable persons will be rejected, whereas some undesir- 
ables will be selected. Yet, cut-offs and other methods have a very 
considerable advantage over subjective procedures previously em- 
ployed; for they provide the data for estimating with greatly increased 
objectivity and accuracy what are the chances of identifying the per- 
sons with the desired abilities or traits. 

Other methods, it will become apparent in later chapters, are used 
in addition to those already explained. Among these others are, for 
example, the percent who are successful in adjacent age groups and 
in groups of known ability; significant increases in scores from age to 
age and from group to group; closeness of approximation of the dis- 
tribution of scores to the normal frequency curve, Also, in validating 
personality scales, extent of agreement by specialists in the interpreta- 
tion of results is an accepted criterion. 

There are instances, too, when very low correlations are regarded as 

, evidence of a test’s validity, For example, if one starts with the hy- 
pothesis that “mechanical aptitude” is a special ability and, as such, 
relatively independent of what is measured as “general intelligence,” 
then in constructing a test of the former, one should, among other 


things, aim to devise a test which has a quite low or negligible cor- 
relation with the latter, 


Item Analysis. With very few exceptions, psychological tests (other 
than projective techniques) are made up of a large number of items. 
The score on each item is added to the scores of the other items to 
obtain a subtest score or a total score, either or both of which are used 
in calculating reliability and validity, Ultimately, however, the quality 
and merit of a test depend upon the individual items of which it is 
composed. It is therefore necessary, in best practice, to analyze each 
item in the standardization process in order to retain only those that 
suit the purposes and rationale of the device being constructed. Item 


analysis is thus an integral part of both the reliability and the validity 
of a test. 


Validity m 


In evaluating items, three aspects are, in the main, considered: (1) 
the level of difficulty of each; (2) correlation of each item with the 
score of a subtest or with the total score of the whole test; (3) the 
degree to which each item differentiates between a high group and a 
low group (variously selected), or between several groups at different 
levels, 

The first of these aspects, item difficulty, is a matter of the percent- 
age of individuals able to pass each item. In practice, if an item is to 
distinguish between individuals, it should not be so easy that all per- 
sons can pass it; nor should it be so difficult that none are able to 
Pass it." It can be demonstrated statistically that an item passed by 
50 percent of a group discriminates between more pairs of persons 
assed by a smaller or larger group. For example, 
if an item is attempted by 100 individuals and passed by only 10, and 
if the testees are taken by pairs, there are 900 (10 x 90) combina- 
tions in which that item can discriminate between paired members of 
that group, If the item is passed by 50 in the group, then the number 
Of possible discriminations between paired individuals is 2500 
(SO x 50), this being the largest number possible, as the multiplica- 
tion of any other proportions will show.’ Obviously not all items in a 
Scale are or should be such as to be passed by 50 percent of the group. 
Some are included that are passed by a large percentage and some by 
a small percentage, with many degrees between the extremes. 

There is no formula for determining the exact distribution of item 
difficulties, A common practice is to select some items whose difficulty 
is at or close to the 50 percent level, and other items with a wide range 
of degree of difficulty, in terms of percent passing. If all items selected 
for inclusion in a test were at the 50 percent level of difficulty, the 
x desirable that the test be so scaled that there is 


at least one item which can be passed by all for whom the test is intended. For 
zero scores on a particular test do not necessarily mean absolute zero capacity 
'n the function being measured; nor will all zero scores necessarily signify the 
Same status. Conversely, it would be desirable that a test be scaled upward toa 
evel where no one for whom the test is intended is able to pass the highest 
tem. This aspect would require, of course, that the test be constructed by a 
Person superior to any of the intended testees. ems : 

25 It is not to be assumed that “50 percent passing” is necessarily the best 
Criterion in placing an item in an age scale (like the Binet), as will be seen later. 
li ercentages passing an item may be converted into scale values on the base 
line of the “normal” curve; that is, into standard scores. The assumption here 
1S that the trait being tested by each item is distributed “normally” (bell-shaped 


curve) in the population being tested. 


than does an item p 


i Theoretically, it would be 


i 


40 Basic Theoretical Principles 


test would, theoretically, simply divide the testees into two groups: 
namely, those above this predetermined dividing point and those be- 
low it. Such items would not differentiate among the individuals in the 
group above the 50 percent level, nor among those below it. Hence, 
for maximum differentiating efficiency, a test must contain items at 
various levels of difficulty as represented by percentages passing. 
The final consideration will be the inclusion of items of such a range 
of difficulty as to yield the highest predictive value when compared 
with the criteria, taking into account the levels of the ability or trait to 
be measured and the degree of differentiation to be achieved. 

A second method of analyzing validity of individual items is to cor- 
relate each item against the score of the subtest of which it is a part 
(e.g., information, arithmetical problems) to determine whether or 
not performance on it is consistent with performance on the subtest as 
a unit. This assumes, of course, that all items in the subtest are ex- 
pected to be homogeneous; that is, that they measure the same psy- 
chological process or combination of processes. 

Each item may be correlated, also, against the score of the whole 
test. In that case, the assumption is that all the items throughout the 
entire test are expected to be homogeneous in basic functions meas- 
ured. When an item is correlated against the subtest score, it is not 

„necessarily expected to show a significant correlation with the whole- 
test score, because it may be the intention of the test’s author to con- 
struct a scale whose subtest scores are relatively independent. 

The third technique is to analyze each item in respect to the per- 
formance of a low group and a high group; that is, low and high based 
upon scores on the test as a whole, or upon some external criteria 
whereby individuals can be classified. As already stated, a very few 
items should be within the ability range of all, or nearly all, testees. 
Others should be of increasing selectivity, Some items should, of 
course, discriminate between two extreme groups, say, the highest and 
lowest 10 percent of the population tested; but it is desirable to have 
items whose selectivity extends beyond these narrow boundaries, 
items that would also dependably distinguish between, for example, 

19 The statistical method used for this purpose ‘is the biserial correlation or 
the point biserial. See any standard textbook on statistics. For presentation of 
the problems and methods in item analyses, see F. B. Davis, Item-Analysis 

Data, Harvard Education Papers, No. 2, Graduate School of Education, Har- 


vard University, 1946. Also, J. A. Long and P. Sandiford, The Validation of 
Test Items, University of Toronto Press, 1935. 


Validity rs 


the highest one-fourth and the second-highest one-fourth, between the 
lowest one-fourth and the next-lowest one-fourth. Kelley has offered 
evidence indicating that most marked and significant discrimination 
between extreme groups is obtained when item analysis is based upon 
the highest 27 percent and the lowest 27 percent of the group. (The 
ratio of the obtained difference to the standard error is a maximum.*°) 
This method, however, provides only a crude item differentiation, 
since it does not provide a basis for differentiating among the large 
middle group of the population, namely, about 50 percent. 

Using this method, one procedure would be to find what percentage 
of the highest 27 percent and what percentage of the lowest 27 per- 
cent passed each item; then, by statistical calculation, to determine if 
the difference between the two percentages is significant. The same 
method can be followed with other proportions as well. In fact, items 
may be analyzed with regard to a wide variety of group classifications. 
Each item might, for example, be analyzed with reference to high, 
age, and low groups, classification being based upon 


average, low aver 
rnal validating criteria. 


total test scores or upon exte 


Validating Objectives. The objective of all validating procedures is 
to make the most useful selection of test types and test items from 
among those available, such as to yield the highest prediction of the 
criterion or criteria. The first step in preparing such test items is in- 
sight into the psychological processes involved. The next prerequisite 
is that the items shall be well and precisely written. Then, basically the 
ultimate decision as to what are the criteria of validity in any area of 
testing rests upon the analytical judgments of and agreement among 
those specialists best qualified to evaluate objectives and behavior, 
who take into account the purposes for which and the groups for 


whom the instrument is intended. 


“The Selection of Upper and Lower Groups for the Valida- 
tion of Items,” Journal of Educational Psychology, Vol. 30, 1939, pp. 17-24. 
treatment of measurement techniques, especially as re- 


For a detailed technical n 
lated to personnel problems, see R. L. Thorndike, Personnel Selection: Test 
and Measurement Techniques, New York: Wiley, 1949. 


2 T, L. Kelley, 


2a 


AAA 


INTERPRETATION OF TEST SCORES: 
QUANTITATIVE AND QUALITATIVE 


AN INDEX OF RELATIVE RANK 


The raw score (that is, the actual number of units or points) 
obtained by an individual on a test does not in itself have much, if any, 
significance. One test may yield a maximum score of 150, another 
200, and a third 300. Obviously, then, any point score on one of these 
tests is not directly comparable with the same number of points on 
either of the others; a score of 43 on one test cannot be directly com- 
pared with a score of 43 on another, Furthermore, the average scores 
of each of these will in all probability be different, as will the degree 
of variation of scores (called the deviation) both above and below the 
average. For example, the average (mean) score of the first test for 
a given age is, let us say, 90, with approximately the middle two thirds 
of the scores falling between 75 and 105. For the second test the mean 
is, say, 120, with the middle two thirds of the scores between 100 and 
140; while for the third test the mean is 180, with the range of the 
middle two thirds between 150 and 210. 

It is clear that if scores obtained on each of several tests are to be 
compared, indexes must be used which will express the relative signifi- 
cance of any given score; or what is known as relative rank. In the 
example given above, assuming that all three tests are intended for the 
same group, the mean scores of 90, 120, and 180 have the same rela- 
tive significance—that is, persons making these scores would be at the 
average in each. Similarly, scores of 75, 100, and 150 have the same 


An Index of Relative Rank 43 


relative significance in their respective tests; for persons getting these 
scores would be one standard deviation below the means (averages), 
which signifies that their scores surpass only about 16 percent of all 
the scores made by the population samplings upon whom the tests 
were standardized. 

Innumerable other comparable points and scores could be selected 
for illustration. Obviously, however, such score-for-score comparisons 
would be extremely cumbersome and would, in each instance, have 
to be interpreted in terms of some common, meaningful index. Hence, 
to facilitate interpretation, psychological tests (with few exceptions) 
provide tables of age norms, OF grade norms, or percentile ranks, or 
decile ranks, or standard scores. These indexes are defined in the fol- 


lowing paragraphs. 


Norms. A norm is the average or typical score on a particular test 
made by a specified population—for example, the average (mean) in- 


telligence test score for ten-year-olds, or twelve-year-olds; the mean 
score for fifth-grade pupils on a test of arithmetic fundamentals, Ref- 
‘erence to a table of norms enables one to rank an individual’s per- 
formance relative to his own or other age groups or grade groups. 
Thus, for example, a child of ten may attain an intelligence test score 
that is average for his own age group, or for a population of nine- 
year-olds, or for those ten and a half, etc.; or a fourth-grade pupil, on 
a test of arithmetic fundamentals, may score at the level typical for 
his grade, or for some grade above or below. 


` Mental Age. By means of tables of norms, it is possible to assign an 
individual an “age” rating, on the basis of his performance on the par- 
ticular test being used. Thus, an individual, regardless of his age, who 
gets an intelligence test score that is equal to the norm of the ten-year- 
old population would have a “mental age” of ten as determined by 
that test, If his score equaled the norm of the eleven-year-old popula- 
tion, his “mental age” would be eleven. Hence we define mental age 
(MA) as the level of a person’s mental ability as expressed in terms 
of the chronological age of average persons having the same level of 


mental ability. 
If a pupil obtains a total score on a battery of 


1 school subjects) equal to the 
ld, he is said to have an “educa- 


` Educational Age. 
achievement tests (covering severa 
norm of pupils who are twelve years © 


44 Interpretation of Test Scores 


tional age” (EA) of twelve. Educational age is defined as the age 
equivalent of an individual's score on an achievement test as shown by 
age norms for the test in question, measuring achievement in a group 
of school subjects. 

In a similar manner, age levels can be determined for individuals on 
any test whose purpose is to rank persons on the basis of performance 
according to age. There are times, however, when it is desirable to 
know the relative level at which a person is located, in respect only to 
a more narrowly specified group. The group, for example, might be 
of a particular age or grade, adults at large, or a particular class of 
persons such as college freshmen. For this Purpose, the most com- 


monly used indexes are percentile rank, decile rank, and standard 
scores. 


Percentile Rank.’ An individual’s percentile rank on a test designates 
the percentage of cases or scores lying below it. Thus a person having 
a percentile rank of 20 (P.,,) is situated above 20 percent of the group 
of which he is a member; or, otherwise stated, 20 percent of the group 
fall below this person’s rank. A percentile rank of 70 (P,,) means 
that 70 percent fall below—and so on for any percentile rank in the 
scale. In effect, this statistical device makes it possible to determine at 
which one-hundredth part of the distribution of scores or cases any 
particular individual is located. By this means a person’s relative 
Status, or position in the hierarchy, can be established with respect to 
the traits or functions being tested. And, as will be seen, psychological 
measurement, unlike physical measurement, derives the greatest part 


of its significance from relative ranks ascribed to individuals rather 
than from units of measurement. 


Decile Rank. The decile rank is the same in principle as the percen- 
tile rank; but instead of designating the one-hundredth part of a distri- 
bution, it designates the one-tenth part in which any tested person is 
placed by his score. The term decile technically means a dividing 
point. By “decile rank” we signify a range of scores between two di- 
viding points. Thus a testee who has a decile rank of 10 (Din) is lo- 
cated in the highest 10 percent of the group; one whose decile rank 
is 9 (D,) is in the second-highest 10 percent; one whose decile rank is 
1 (D,) is in the lowest 10 percent of the group. 


1 Also called “centile.” 


An Index of Relative Rank 45 


Standard Score. This index (Z) is somewhat less obvious in its 
meaning than percentile and decile ranks, although it, too, designates 
the individual’s position with respect to the total range and distribu- 
tion of scores. The standard score indicates how far, in terms of stand- 
ard deviation, a particular score is removed from the mean of the 
distribution. The mean is taken as the zero point, and standard scores 
are given as plus or minus. If the distributions of scores of two or 
more tests are approximately normal (“bell-shaped”), standard 
scores derived from one distribution may be compared with those de- 


rived from the others. 
The formula is: 


in which X is an individual score, M is the mean of the distribution 


and SD its standard deviation. 
Assume, for example, that the 
the standard deviation is 14. In 
ing an IQ of 114 has a Z score of + 

an IQ of 79 has a Z score of — 1.5. 
TABLE 8 


Proportions of Cases, or Area Under the Curve, 
ling to Given Standard Scores 


mean IQ of a group is 100 and that 
this distribution an individual reach- 
1.0, Another individual having 


Corresponc 
Z Score Approximate Percent 
of Cases from the Mean 

325 10 
50 19 
5) 27 

1.00 34 

1.25 39 

1.75 46 

2.00 48 

3.00 49.8 


Standard scores must ultimately be given percentile values to €x- 
Press their full significance. Since the number of cases encompassed 
within a given number of standard deviations in a normal distribution 
is mathematically fixed, it is always possible to translate a Z score into 
a percentile value. Thus a person having a Z score of + 1.0 has a 
percentile rank of approximately 84: that is, his score surpasses 84 


46 Interpretation of Test Scores 


percent of the scores in the group. The person having a Z score of 
— 1.5 has a percentile rank of approximately 7, surpassing only 7 
percent of the scores. Table 8 below shows several standard scores 
and their percentile values, for illustrative purposes. 

As an index of relative rank, the standard score is preferred by 
some psychologists because it is a well-defined property of the normal 
curve, representing a fixed and uniform number of units throughout 
the scale. Percentiles and deciles, on the other hand, are positions of 
rank in a group and do not represent equal units of individual dif- 
ferences, 


Quotients. The use of tables of norms for the determination of 
the several kinds of test “ages” has already been mentioned." Now, 
in addition to these performance “ages,” it is a rather common prac- 
tice to determine a “quotient.” Of these, the most widely known is the 
intelligence quotient (IQ), found by the simple formula: 


_ MA 
IQ TA (100) 


in which MA is the individual's mental age and CA is his chronologi- 
cal age. Thus, it is clear that the IQ is the ratio of one’s mental age 
to his life or chronological age (multiplied by 100 to remove the deci- 
mal) and indicates rate of mental development or degree of bright- 
‘ness. If mental development keeps pace with life age,’ the quotient is 
100; if mental development lags or is accelerated, the quotient will be 
less than or greater than 100, depending upon the degree of retarda- 
tion or acceleration. 

Tests of educational achievement, as already stated, yield educa- 
tional ages (EA). This index may be divided by 
cational quotient (EQ); that is, an index showing, presumably, 
whether a person’s knowledge and understanding of a group of school 
subjects are commensurate with his life age, or whether above or be- 
low what would be expected of him for his age. 

Educational age may be divided also by MA instead of CA. If 
that is done, the index is the accomplishment or achievement quotient 
(AQ). The reason for using the mental age instead of the chronologi- 

? There are a few tests for which mental ages are not derived from tables of 
norms, notable among these being the Stanford-Binet Scale and the Merrill- 
Palmer Scale. These tests and their Scoring technique are presented in subse- 
quent chapters. 

* That is, until maximum capacity is reached. 


CA to give an edu- 


An Index of Relative Rank 47 


cal age in the denominator is that the former is regarded as the more 
valid index of learning capacity. Hence, dividing the EA by the MA 
yields a quotient which shows whether or not the individual is working 
up to mental capacity as revealed by the intelligence test.* 

Subsequently, more will have to be said concerning these and other 
quotients. For the present, however, it is to be noted that quotients, 
like the other rating devices already presented, are in fact indexes 
whose significance is to be found to a considerable extent in the rela- 
tive status they give individuals. 

Let us assume that we are dealing with three boys, all of the same 
age. Suppose that their intelligence quotients are 50, 100, and 150. 


Since these are numerical ratios ($ x 100), it is natural to assume 


that they have a quantitative significance. So they do—for they indi- 
cate rate of mental development. But these quotients also have a 
qualitative significance—for, among other things, they indicate each 
boy’s position in the “hierarchy of intelligence.” If the measure of in- 
telligence is valid, the boy having the IQ of 50 is seriously retarded 
and is in the lowest one percent of the population in respect to the 
psychological functions being tested; the boy with the IQ of 100 is 
the “typical” or “average” individual, midway up (or down) in the 
distribution of intelligence; and the boy having the IQ of 150 is very 
superior and belongs in the top percentile rank of the group. 
Qualitative significance of the intelligence quotient can be illus- 
trated further by asking this question: Is the brightest of these three 
boys one and one-half times as intelligent as the “average” boy, and 
three times as intelligent as the retarded one? The fact is that this 
answered in terms of numbers; it is impossible 
actually to say how many “times” more capable or less capable one 
is than the others, because the IQ is not a percent. But each of these 
quotients has certain connotations. In this example, the qualified 
School or clinical psychologist will be able to draw important infer- 
ences from each boy's 10 regarding rate and quality of school learn- 
ing, extent and level of educability, vocational possibilities and levels, 


and probable types of interests. 
The boy with an IQ of 50 probably will not be able to complete 


question cannot be 


ms concerning the AQ (such as the 


* There 2 ical proble 
ee E hich will not be dealt with at 


logical fallacy of an AQ greater than 1.00) w 
this stage, but will be treated in Chapter 14. 


48 Interpretation of Test Scores 


more than the second grade: the boy having the IQ of 100 should 
be able to complete twelve grades; the boy with an IQ of 150 will be 
able to progress in education as far as his interests and motives indi- 
cate. Obviously, too, the kinds of occupations that will be open to the 
first boy are very limited: those open to the second will be numerous; 
those open to the third will be practically unrestricted, so far as the 
factor of mental capacity is concerned. And the same may be said of 
the range of interests in general that will be within the scope of each. 
These facts are of clinical significance, but at present there are no 
psychological or statistical means whereby one can calculate how 
many times more or less capable one person is as compared with an- 
other. 

A caution is necessary at this point. The inferences drawn in the 
preceding paragraph cannot be based solely upon the numerical IQ 
value without reference to the clinical features in the test performances 
or other factors not shown by the numerical index. We have assumed 
that there are no complicating factors and that the IQ’s are valid 
measures of the capacities and performances of the three boys. The 
boy with 150 IQ, however, might be an unstable personality who is 
failing in most or all of his school subjects. The boy with 100 IQ 
might have been penalized on the test by linguistic factors. And the 
boy with an IQ of 50 might show a “scatter” (inconsistency and varia- 

` tion) of performance indicating emotional disturbance rather than 
intellectual impoverishment. Occasionally, also, it will be found that 
a high test rating may be attributable to an inconsistently high level 
of performance on one or a few types of subtests (e.g., memory span, 
word knowledge), just as, conversely, it occasionally happens that a 


person’s IQ is depressed by an inconsistently low performance on one 
or a few subtests.° 


PSYCHOLOGICAL MEASURE 
PHYSICAL MEASUREMENT 


The indexes thus far presented—ages, percentile rank, decile 
rank, standard score, and quotients—are intended to emphasize the 


principle that basically all psychological tests yield results that rank 
individuals in relation to their fellows. 


MENT CONTRASTED WITH 


ê By “inconsistent” we mean that the individual’s levels of Performance on 
these few subtests differ markedly, in one direction or the other, from the gen- 
eral and more uniform levels of his scores on the other subtests, 


Psychological Measurement 49 


The raw scores obtained on psychological tests are not comparable 
with or similar to the values obtained in the measurement of physical 
traits or phenomena, as, for example in measuring length, weight, or 
light intensity. In the physical realm, the units of measurement are 
fixed and constant throughout the entire scale. An inch, a pound, a 
candle-power—each has the same value and physical significance at 
whatever place on the scale it is measured. Psychological measure- 
ment, by contrast, is more difficult and is confronted by special prob- 
lems. In the first place, it is not possible to determine the “inherent” 
a psychological test in terms of constant units, 
as it is possible to find the length or weight of any object. Whereas 
in the measurement of physical phenomena it can be found that a 
given object is of X length and Y weight, and hence, let us say, twice 
as long and three times as heavy as another object with which it is 
ect measurement and comparison are 
alm, the measurement value 


difficulty of an item in 


being compared, no such dir 
possible in psychological testing. In this re 
of a test item is dependent basically upon the percentage of persons 


able to pass the item in the population group for whom the test is in- 


tended. 
If in the testing of a particular ability, one item is passed by only 


ten percent of a group, whereas another item is passed by fifty percent, 
it cannot be said that the first is five times as difficult as the second, be- 
cause “percent passing” is not a unit in the sense that an inch or a 
pound is. What can be said is that an individual able to deal success- 
fully with the first item belongs in the highest decile group, while one 
who cannot pass the first but is able to pass the second falls at the 
midpoint or “average” level of the group in respect to that item of the 
test. This interpretation is significant psychologically and educa- 
tionally. Or, to use another instance, in the case of the Stanford-Binet 
Scale the age level at which an item is placed—and hence its value 
in the scale—is determined by the age group in which the “average” 
individuals pass that item. Thus, anyone able to solve a reasoning 
Problem placed at the ten-year level, for example, but failing to pass 
Teasoning problems at the eleven-year or higher levels, may be said 
to have typical ten-year-old ability in respect to that mental task. 

This leads directly into the problem of the meaning of “mental age” 
and other “age” units. Mental age may be defined as the level of a 
person’s mental development expressed in terms of the chronological 
age of average individuals of the same level of mental development. 


50 Interpretation of Test Scores 


Or, otherwise stated, mental age is an index showing one’s level of 
mental development, corresponding to the level of mental develop- 
ment of average persons of the coinciding chronological age. Thus, if 
a child’s mental age is ten, he has reached the level of mental develop- 
ment attained by average children of ten years, regardless of the actual 
life age of the child being tested. 

Now, suppose there are four individuals having mental ages, re- 
spectively, of five, six, twelve, and thirteen. Is the difference between 


15 


12 
| 


fe) 


Mental Age 


O% resa sereo oiei 


Chronological Age 


ric. 2.1. Hypothetical Curve of Mental Growth, 
Illustrating Decreasing Yearly Increments, 


mental ages five and six the same as that between mental 
and thirteen? It is not; for, as measured by ment 
mental development at five and six years of 
subsequently; hence, the increment betwee 
greater. Figure 2.1 is the curve accepted by most Psychologists as rep- 
resenting rate of mental development. The outstanding feature of that 
curve, for the problem under consideration, is the fact that it rises at 
a decreasing rate with increasing age. The curve is “negatively ac- 
celerated.” Or, psychologically speaking, with each succeeding year, 
until maximum development is attained, the amount of increment in 
mental growth and development is less than in the preceding year, 

It is obvious, therefore, that each successive year of mental age 
added to an individual’s level represents something less in measurable 
growth and development than the preceding year’s increment. In other 
words, mental age units are not uniform; they simply rank an indi- 


ages twelve 
al tests, the rate of 
age is more rapid than 
n the earlier years is 


Clinical Aspects 51 


vidual with respect to the average mental capacity of an age group. 
The same principle—nonuniformity of age units—applies also to all 
other types of psychological tests which translate their scores into age 
equivalents, 

Although psychological testing would be facilitated and would be 
given a higher degree of precision if its measuring units were fixed 
and uniform, the fact is, nevertheless, that the available indexes of 
relative rank are essential educationally and clinically in evaluating 
the status of an individual’s mental development, his educational prog- 
ress, his particular aptitudes, his social maturity, and even certain non- 
intellective aspects of personality. Experimentally, too, these same 
indexes have been indispensable in studying a host of practical and 
theoretical problems such as sex differences, effects of environmental 
conditions, inheritance of intellectual capacity and of special aptitudes, 


occupational differentiation, racial differences, relationships between 


physical and mental development, and many others. 


CLINICAL ASPECTS 
Scores, whether raw or converted, do not suffice for the com- 
plete interpretation of an individual’s performances on psychological 
tests. The several aspects of test standardization thus far presented are 
concerned with the performance of groups of persons and with aver- 
age relationships revealed by statistical treatment of results. It hap- 
pens, however, that while certain types of test items meet some or most 
of the statistical requirements of validity, they are unsatisfactory as 
indicators of intelligence when used for clinical purposes. For ex- 
ample, on the Stanford-Binet scale, the percentage of adults able to 
repeat eight digits forward (digit-span test) is approximately the same 
as the percentage who can solve one of the more difficult reasoning 
problems. Yet, in clinical examinations psychologists find some adult 
mental defectives who can pass the former test, though it never hap- 
pens that a true mental defective can succeed with the latter. What this 
means, in effect, is that statistical validation of a test item is not always 
sufficient; it must be supplemented by the pragmatic criterion of use 
with a wide variety of individuals in a variety of situations to show 
whether or not it has discriminative value as between individuals at 
the several levels of ability. 
Psychological tests, as already noted, are standardized on the basis 


of the performance of a representative population; and an individual’s 


52 Interpretation of Test Scores 


rating is determined by the relationship of his performance to that of 
a group as a whole. Thus we have the several “ages” (e.g., mental 
age) and “quotients” (e.g., intelligence quotient), percentile and 
decile ranks, and standard scores. Any useful test should yield one or 
more of these ratings. In more recent years, however, without denying 
the usefulness and value of these indexes of relative status, increasing 
emphasis has been placed upon “patterns” of performance as clinical 
aids to psychological diagnosis and counseling. 

A person’s responses to tests are now frequently analyzed for the 
purpose of discovering whether he shows any special abilities or dis- 
abilities, whether there are marked discrepancies between responses 
on some types of materials as against responses on others, whether 
certain psychological processes seem to be impaired or are markedly 
Superior to others within the individual. A general contrast, for ex- 
ample, might be found between tests involving verbal materials and 
those which are nonverbal in character; the associative processes 
might be disturbed; memory or spatial perception might be found 
to deviate markedly in one direction or another from an individual’s 
general level of capacity. Recent investigations have indicated that 
“patterns” of response may be useful in differentiating and diagnosing 
the several categories of maladjusted and abnormal personalities, as 
well as for discerning more clearly the mental defectives, 

Also, it has been found that persons of equivalent general mental 
status may have different “patterns” of performance, or abilities, 
which in sum, nevertheless, give them much the same over-all and 
general ratings in terms of a single index (e.g., mental age, percentile 
rank). That is to say, it is possible for two persons to have test ratings 
which are numerically similar but whose “mental organizations” are 
dissimilar, since the components of each total rati 
or lesser degree from those of the other. 

If, therefore, the psychologist’s concern is not primarily with group 
trends or averages but rather with a particular individual, he will, to 
be sure, want to know the age level of performance and the conse- 
quent “quotient”; but he will also analyze the details of the individual's 
performance for the Purpose of discovering that person’s particular 
pattern or idiom, in order to discern his particular form of mental 
organization, specific evidences of retardation, or disability, if any, 
and details of his development. 

In more recent years there has been a partial shift in emphasis from 


ng differ to a greater 


Clinical Aspects 53 


almost exclusive concern with the analysis of abilities and methods 
of psychological measurement, as such, to an examination of indivi- 
dual performance and individual idiom, and to the individual as a 
functioning and dynamic unit. After all, any given test measures only 
a segment of a total personality; that segment is an integral part of 
the totality and is influenced by the whole. Hence, the psychologist 
who is concerned with insight into the nature of an individual’s abili- 
ties must be able to evaluate a person’s performance as well as meas- 
ure it. The data and indexes derived from psychological tests are, for 
the most part, objectively determined; but their clinical use involves 
judgment, subjective assessment and interpretation, based upon a 
variety of data from several sources. The experienced clinical exam- 
iner will supplement the test’s numerical results with his observations 
of the testee’s attitudes during the examination and the manner in 
which he attacks the problems of the test: his degree of confidence or 
dependence, his cooperativeness or apathy, his negativism or resent- 
ment, the richness or paucity of his responses. The individual test 
situation can thus be, in effect, an occasion for general psychological 
observations—really a penetrating psychological interview. 

Ability not only to score a test but to assess and interpret responses 
and to evaluate the individual’s behavior during the examination is a 
clinical art that the psychologist develops from working with persons 
rather than with tests alone, though for the practice of his art he must, 
of course, thoroughly understand the psychological and statistical 
foundations and hypotheses upon which the tests are based. 

A few specific instances of the qualitative analysis and interpreta- 
tion of test responses will illustrate the kinds of observations that con- 
stitute the clinical aspects that supplement numerical scoring. 
are generally acceptable at a fairly elementary 
level; but they vary in level and quality from purely concrete, to 
functional, to conceptual or abstract. Differences in quality level are 
indicative of differences in modes of thinking. It happens at times, 
also, that some words are emotionally charged for the examinee, in 
which case his definition and behavioral response may be revealing. 

Some test items permit the exercise of considerable freedom in 
response. These responses may reveal the examinee’s attitudes, values, 
and modes of meeting life situations. In this category are test items 
that ask, “What is the thing to do when . . . ?” Or, “Why should 
_ 2 The subject’s reactions to such items, the qualities of his 
aking the responses, and the presence or absence 


Word definitions 


we. 
verbalizations in m 


jeuonows sjenplarpur ayy 


*JOIYUOS jejuəu 10 a}VIS 
yoryas ul uonIpuo? y, 


o} ənp pepadu sı saniyiqe jeyuaw jo Zuruonouny IY} 


əy} surpraoid ur pue ‘uonouny JO e1} Aue ul Ayypiqeuiva uewny jo 
sues 12313 oy) SuNeNsuowap ur ‘aroos a[suls Aue OI aouLoyIusIS pue 
suruvour asioaid 9.10 SUIAIS UT ‘UOTENTRAD JO ssao0id əy} Ul poiofoud 
aq Áew soueuojiod sjenprarpur Aue yoryas qsurese SWOU Surpraoad 
Ul penuəssəð J SaIpNys Yong *Kressgoouun IIe sorpnys dnois pur [ro 
-nspeIs YI ULI Jou səop sjoyar e se [eNpIAIPU! ou) uodn pue (quo 
-jsn{peyew peuoneanpo pue Ioraeyaq Jo sasvo Sursousvip U! Ayjeisadsa) 
gourulioj.iod ysoq ur uszayyed [enprarpul əy} JO gouvoylusis ay} uodn 
siseydiua Sursvorour ind o) Surpuay axe sistFopoyodsd WU 173 OL 
“SUONIPUOD [L]UIWIUOIIAUD ajquioavyun 19410 Aq 10 
‘sdeoripuvy peuonow Aq ‘salouatoyap peuonnmu Aq posredu oq Aew 
snqeys danejer Juonbasuos pur sduewojiod ‘spenprarpu! IUOS JO SLI 
oq} ur y awoIpur ‘pury aus əy} Jo s190 PUL ‘sjovy asoyL “aso 
aq} JOU st Yp ‘asinos Jo ‘uay uowdojəaap PUAU 10} ‘Jusuuoa 
-uə Jo syoadse 1əsso13 əy) ur ‘ÁAyunyoddo yenba Kjwwxosdde pey 
davy pourwexa Suraq spenptarput pe yey} uonduinsse oy} uo paseq e 
sumou }s3} SVAIOYA\ ‘JUoWWYSsHIOAOdWI jeorojoypasd Jo suon!puo>? Jopun 
podojaasp Aay} asneseq Ayyenb pue Aynuenb JoLayut JO SI S}S9} pozr 
-pivpur}s UO UdIppIys swos Jo gouvuiojiad oy} ‘gIOWIOYIN] , «BUI 
-490]q,, pue sainssoid jeuonowə Jo asnvoaq uonepivjas JO [949] e e 
gunsvour Aew sayo PNS apryar ‘xəjdwoo g UIWIE}A JO sərDuə3190yəp sno 
-nəs Woy Surayns aq Lew uawdojəaəp [eI popievqar Apussedde 
Jo ‘uapa 12O sisə} PozipsrepuLys 0} SuIpsosov Ayordes jeyuou 
ur JUaIoyap pue ‘onoyjede ‘ssafist] qeadde [pM vluaue 319AƏS WOI 
Suriayns aie OYA JO salduaIyap JO saoueqinisIp JeuoNnLyNU sNolias 
pue pasuojoid wor posayns savy oya UaIp]IYS ‘g0uvISUI IOJ ‘]P}0} 
ay} ur savy Koy} yed ay} 0} Surpso9ov Vep aaneinuenb aansalqo ayy 
Sunaidsojur uodn pue ‘[enprarpur yey} JO samanoe pur ‘suonvalosgo 
‘squaWaINseat IayJO JO AJOLBA e JO 3431] ay} Ur snes PUL OULLULTOY 
-12d s qenptarpur uag Aue Sunenjeao pue SUMA uodn sjsIsut ‘stou 
pue sisa} pazipivpuris Suzyn pue Sundasor əys ‘yovoidde jes 
-u19 34L “JWawdoyaaap siy ur sanvsado 213M S107978} PYM Aypeoyrsads 
aqeoipul Ady} Op 10u taray} paare uosiad ay} asiNod PYA Aq Jaurwexa 
Jeo1Sooypdsd ay} 119} ‘Jəaəmoy ‘Jou Op Aay_L 'POUNULXO uosiad yoro jo 
snes quasaid ayy oyeorpur—Surusea] Jooyss Jo ‘Aypeuosiod ‘opninde 
oytoads ‘aouadi][ayUl [V19UIS JO SI I JayJYyA—ISa} E UO ssurpuy L 


SS sjoads Y [DOIN D 


*sjoadse peu uo ‘p| 413} 
-4O ur uaa aq ppm suoneyaadiaur əAanewenb ƏSƏ) JO SUONEISNIL y 


y PANY pue ssoo. 

“9S YlOG ur papasdo.d joafqns ay) MOY IPNJLAI O} “[R1aUdS Ul ‘uy 
9[qeua jpm Jourwexa ay} Jo umed ay) uo SUOIJBAIOSGO PdZIISUIS 

“suonouny 

Teorsojoyotsd pasedu pue “JOWIJul “AOLIadns Jo juəwuəosıp əy} 
OF peņuəssə st (ç] saideyD ul [fap Ul Passnosip) sısÁjeue 19L 


‘s1340 Suowr ‘(anaaijaq) 
189} A[quiasse yoaf{qo ay) pue (joUIg-p1OjuRIS) Iso} Zurádoo puowrip 
ay) Aq payeoipur se Yonsuny J0jow-jensra ay} JO aouequnjsip “Fə 
*S}Sa}qns Jo spury payoajas ysnoiy) əewep o1uesi0 jo uonse}9q 


“waqoid 10 yse1 ay) Jo suonvjasdsaj}ut pors0}sip 
Áq pure ‘sasuodsas JULA pue pajurolsip Aq ‘(0000051 SI $3912]Ş 
peur) Əy} Jo uonepmdod əy; ‘eouyy Ul SI uopuoqg “3'ə) uossəd 
JUdST[[OIU! Əsm1əy0 UL Áq sasuodsa1 ə1eziq Ájsnoəuoə ‘ajduexa 10} 
1s938}S ƏNoyoÁsd Jo jeaiojoyyed ayeoIpul sasuodsai Jo sədÁ} awog 


“sosuods 
“a1 dATVUIOIe AievssadouuN pur snosownu Aq pue sasuodsar pair) 


-ap ÁJƏAISSƏƏXƏ Aq poyeoaos aq Av (tustonio-jjas pue ssauysnoi0y) 
əjqeıısəp 0} posoddo sv) ssouaatsindwos se yons xen Ayyeuosiad y 


‘punossyorq jeInyjno ‘S}SƏ19}U1 ‘sooua 
-Wodxo sty uodn 3431] poys jpm (pasn Áļəpım sysəzqns 0m) uonew 
~JOjuL pue Áepnqeooa s,uosiad e Jo ‘Kur ji ‘IƏJOLIVYI DAN DaIas IYL 


“wa[qoid Isə} X 0} yovoid 
-de jo əpow sty Áq pƏVƏAəI ayo our se} B əzıueĝio 0} pue ‘uodn 
9¥1}U99U09 0} ‘PleMO} UONUINL siy JaIIp 0} Ayyige saoafqns oy 


“ssousnonudy 
-əd pur douvinsse YIM UaAld ase yng Palou ale Jey sasUodsal 
[eoniioun pur snonyədw Aq payesipul aq Aew siapsosip 19V1VY 


*Alpeuossad s$,99}$9} ƏY? 0} Sanyo ayqenyea 
aplaoid sdurws0jiod umo S,0U0 JO WSIS pju ‘apne poxejol e 
“yooads Jo əsvə ‘puey 19410 Əy} uo ‘IO ‘Furysnyq ‘SIJUIWIAOW ssap}so1 
‘Buquunyş pue Sunpey ‘səanəjdxə zo asn əy} ‘yooods Jo Iəuuew ƏL 


“AWLINDaS pue ssauaanvsadood Iwosəoym JO ‘sənərxue pue 
SƏMHSOŲ [VaAo1 Áew sjuəwwo> WOpPUI IO sway 0} sasuodsay 


*JUDWUOIIAUD siy U! suonn}yysur pue 

S[PNPIAIPU! 1940 pIemo} JO ‘(JOUTWRXd Jy?) INFY AwoY Ne UL pIVM 

-0} JO ‘Jjswiy premo; Apne siy 0} plegas ur aduLdYyIUsIS əjqıssod 
JO ə ysg} e Surwmojiod əpym sjuəwwos əyroəds sjoafqns IL 

“Áujeuossəd 

sty JO sjoadse dandajjajuruOU Əy} JO JUOS JLƏAəI SBuI[Ia} JUOS JO 


Sal0I§ 152 J, fo uolvjasdsajuy FS 


56 Interpretation of Test Scores 


means of more precise study of interrelationships among psychological 
traits and functions. 


DIFFERENCE BETWEEN NORMS AND STANDARDS 


Norms, as already explained, are average scores or values 
determined by actual measurement of a group of persons who are 
representative of a specified population; for example, all twelve-year- 
old boys, all fourth-grade children, all native-born male adults. Norms, 
therefore, are averages obtained under prevailing conditions—good, 
bad, or indifferent. These norms may well “. . . reflect all the sins of 
omission or commission in their [the people’s] nurture and must be 
critically examined lest we set up as desirable norms for achievement 
what are but accidental outcomes of our unsystematic and unenlight- 
ened nurture of children . . .”* In other words, a norm of psycho- 
logical performance or of a physical trait is not necessarily one with 
which we should be satisfied; for it reflects development under condi- 
tions that may be and often are much less than optimal, As an exam- 
ple, consider the “average vocabulary” of the eighth-grade pupil. The 
average, or norm, of this group will be dependent in part upon their 
Opportunities from earliest childhood for the acquisition and use of 
language. Their opportunities might have been extremely poor, mod- 
erately satisfactory, very good, or at any other level between these 
three. Norms of performance in respect to some of the psychological 
processes measured by means of tests of intelligence and of specific 
aptitude are likewise dependent upon conditions and opportunities 
present during the course of development. Norms of height, weight, 
and other body measurements will also reflect past conditions of nu- 
trition and health. 

It is necessary, therefore, to distinguish between norms, on the one 
hand, and standards, on the other; for a standard is the de. 


sired goal 
or objective, which may well be above the obtained norm and can be 


achieved only under improved conditions of development, It is possi- 
ble that the grade-norms for reading rate and comprehension are be- 
low what they could be under improved educational conditions 
teaching methods; that age and grade norms for numerical a 
below what they might be; that universal nursery 


and 
bility are 
school and kinder- 
“L. K. Frank, “Research in Child Psychology: History 


Child Behavior and Development, R. G. Barker et al., 
McGraw-Hill Book Co., 1943, p. 9. 


and Prospect.” in 
editors, New York: 


Factors in Selecting a Test 57 


garten experience would promote children’s perceptions of form and 
color and would improve their motor skills, and hence raise the norms; 
that universal optimal nutrition would raise age norms for height and 
weight, etc. 

Psychological tests measure traits and functions as they exist under 
present conditions. They do not provide the psychologist and educator 
with an index of what ought to be, except by implication and insofar 
as obtained results might raise certain suspicions, doubts, and queries 


in the minds of investigators. 


FACTORS IN SELECTING A TEST 


By way of summary, the following factors are given as those to 


be considered in selecting a psychological test. 


The test must provide appropriate and accurate norms, 
form of age, grade, percentile rank, standard 
Norms should be meaningful with regard to 
he test is intended and to the groups of per- 
used. 


Norms. 
whether they be in the 
score, or any other type. 
the purposes for which t 
sons with whom it is to be 


Administering and Scoring. The procedures of administering the test 
should be objective and the test items should be amenable to relatively 
objective and simple scoring, insofar as the nature of the instrument 
permits. Individually administered tests, like the Stanford-Binet, at 
times require insightful judgment in scoring responses. Interpretation 
and evaluation of responses are even more significant in the scoring 


and analysis of projective techniques for assessing personality. 


The length of the test should not be so great as 
to produce boredom, satiation, or negativism; for when these set in, 
the subject does not perform at his best level. Specific time limits can- 
not be prescribed for all tests or for all types of testees; but in general, 
shorter time requirements are indicated for younger children and for 
the mentally retarded. In the case of both of these groups of subjects, 
the attention span is relatively brief; hence, it may be necessary at 
times to complete an examination in two sessions. 


Time Requirements. 


Interest Level. Test items should be of sufficient interest to motivate 
the individuals for whom they are intended. Particular items and types 
of problems devised to measure given functions must be suitable to the 


58 Interpretation of Test Scores 


age levels of examinees. Thus, in constructing an intelligence test for 
the entire range of adult capacity, from very low to very high, it is 
necessary that the items placed even at the very low levels should be 
of the sort that will interest an adult rather than a child, even though 


these low-level adults may be inferior to some children so far as men- 
tal capacity is concerned. 


The Population Sample. The manual of a test should state in detail 
the nature of the population sample on which the instrument was 
standardized and upon which norms are based. The information given 
should include the following: total number of cases, age range and 
number at each age level, number of each sex, geographic distribution, 
socio-economic status and number in each category. For some tests, 
it will be relevant, indeed necessary, to have information on some of 
the following: school-grade distribution, number of years of schooling 
completed, amount and kind of special training (especially for tests of 
specific aptitudes), special or “abnormal” adjustment problems and 
history (especially for tests of personality). In short, the prospective 
user of a test must be certain that the test has been standardized on an 
appropriate sample of the population and for the same or similar 
Purposes as those contemplated by the prospective user, This principle 
seems axiomatic; yet it is not always given due consideration. 


The Functions or Traits Measured. 
State the purpose of the instrument; 
possible, an analysis (psychological 
or traits being measured, 


The test manual should not only 
it should also provide, so far as 
and statistical) of the functions 


Reliabilities. Coefficients of reliability should be Provided not only 
for total scores but for part scores as well, wherever Possible. Also, 
reliability coefficients are desirable for each of the several age-levels 
ge of the test. Furthermore, 
ethods have been used in 
i, Prospective user of a test 


him answer the question: 
“Reliable for whom and for what Purposes?” 


Validity. Data on validity are of several kinds 
efficients of correlation; e.g., expectancy t 
nificance of differences between 


. in addition to co- 
ables, known groups, sig- 
age levels, etc. The test manual should 


Factors in Selecting a Test 59 


explain the characteristics of the criterion groups, the nature of other 
criteria used, the validity of the total test, and the validity of the sub- 
tests. It is desirable, also, to have data regarding validity at each of the 
several age and ability levels. Here again, an answer must be sought 
to the question: “Valid for whom and for what purposes?” 


Reports of Experiments. An ideal to be aspired to is to have test 
manuals (subsequent to the first or earliest editions) include sum- 
maries, findings, and interpretations of the most important experi- 
mental studies to which the test has been subjected by various 
psychologists, Such information will heip users to understand more 
fully the nature of the test and the factors affecting performance on it, 
thus making for sounder interpretation of results obtained by those 
who use it. For example: What is the influence of cultural factors? 
Of practice? Of time limits? Of psychotherapy or counseling? 


Psychological tests are scientifically constructed instruments, based 
upon psychological and statistical principles, as explained in the pre- 
ceding pages. Familiarity with these principles should provide students 
with a sounder comprehension of both the values and the limitations 
of tests than they would obtain from using and interpreting these in- 
struments in a mechanical manner. It is also true that when we test 
human subjects we are dealing with individuals who do not behave 
like mechanisms under complete control, with all environmental forces 
likewise under control and measurable. On the contrary, human be- 
havior is often subtle and the psychological forces motivating or in- 
fluencing persons in a test situation may be elusive and difficult of 
evaluation. Furthermore, as has been indicated, since the quantitative 
data of psychological testing are not as definite, precise, and uniform 
as are the data of physical measurements, the interpretation of test 
findings is the more difficult. For these reasons, we have emphasized 
Not only the well-defined scientific principles and procedures of test- 
ing, but we have stressed the qualitative and clinical aspects which are 
essential, if test findings are to be of the greatest value to the indi- 


viduals examined. 


3. 


MAAA AAAA AANA AEAEE AA VATANA AVUSESE 


DEFINITIONS AND ANALYSES OF 
INTELLIGENCE 


DEFINITIONS OF INTELLIGENCE 


If intelligence is to be measured and assessed, it is necessary to 
define it, at least tentatively. A variety of definitions have been given 
by psychologists; but, as a matter of fact, each of them can be classi- 
fied into one of several groups. 

One group of definitions places the emphasis upon adjustment or 
adaptation of the individual to his total environment, or to limited 
aspects thereof. According to definitions of this type, intelligence is 
general mental adaptability to new problems and new situations of 
life; or, otherwise stated, it is the capacity to reorganize one’s behavior 
patterns so as to act more effectively and more appropriately in novel 
situations. Thus, the more intelligent person would be one who can 
more easily and more extensively vary his behavior as changing con- 
ditions demand; he has numerous possible responses and is capable of 
greater creative reorganization of behavior, whereas the less intelligent 
person has fewer responses and is less creative. The more intelligent 
person, accordingly, can deal with a greater number and a greater 
variety of situations than the less intelligent; he is able to encompass 
a wider field and to expand his area of activity beyond that of the less 
intelligent. 

A second type of definition states that intelligence is the ability to 
learn. According to this definition, then, a person’s intelligence is a 
matter of the extent to which he is educable, in the broadest sense, The 


Definitions of Intelligence 61 


more intelligent the individual is, the more readily and extensively is 
he able to learn; hence, also, the greater is his possible range of experi- 
ence and activity. 

Still others have defined intelligence as the ability to carry on ab- 
stract thinking. This means the effective use of concepts and symbols 
in dealing with situations, especially those presenting a problem to be 
solved through the use of verbal and numerical symbols. Binet’s con- 
ception of intelligence belongs largely in this category, for he main- 
tained that it is the capacity to reason well, to judge well, and to be 
self-critical. 

It should be apparent that the three foregoing categories of defi- 
nitions are not and cannot be mutually exclusive. For the most part, 
their authors differ in emphasis. Obviously, ability to learn must pro- 
vide the foundation for adjustment and adaptation to changing or new 
conditions. And a person may be expected to have learned more or 
less from situations he had encountered and to which he had made 
adjustments previously. For if this were not the case, he would have 
to start anew in every situation which confronted him; there would be 
no difference between the behavior of an experienced person and that 
of a novice. 

There are, of course, individual differences in respect to learning 
capacity and in ability to retain, interpret, organize and apply what has 
been learned; thus previous experiences will have different significance 
and different learning-value for different persons. And it is learning 
capacity that constitutes the basis of adjustment and adaptation, al- 
though, as will become apparent in later chapters, important nonintel- 
lective factors affect adjustment and adaptation. 

Yet learning capacity, in the sense only of acquisition of informa- 
tion and knowledge, is not a sufficient criterion by which to evaluate 
a person’s intelligence. Psychologists and laymen alike are agreed that 
a person who can reorganize and apply what he has acquired for the 
Purpose of dealing with varied and novel situations is more intelligent 
than one who is capable of little beyond repeating what he had previ- 
ously acquired, or than one whose behavior follows stereotyped pat- 
terns without insight into the essential elements and relations of each 
new situation. Thus a definition of intelligence as the capacity to be- 
have appropriately and effectively in new situations and a definition 
of intelligence as the ability to learn are in fact two aspects of the same 


process. 


62 Definitions and Analyses of Intelligence 


The third type of definition is also inseparable from the other 
two. A person learns abstractions—principally verbal and numerical 
—through experience, through contact with and perception of the 
objects, events, qualities, relationships, etc., for which the symbols 
stand. Thus, the word “dog” has meaning for a child because it has 
come to represent a class of objects with which he has become fa- 
miliar. The word “green” represents a quality he has perceived as an 
aspect of a variety of objects. The word “charity,” for the individual 
who has developed sufficiently to understand the concept, has a cer- 
tain connotation by virtue of the fact that he has experienced events 
which have been labeled as charitable. The number “five” is meaning- 
ful to a person when, as a result of experience with concrete objects, 
he apprehends the term as representing not only ordinal position but 
summation as well. Furthermore, if it is to be said that an individual 
has fully learned to deal with the symbols of abstraction, then it must 
be true that he understands that the word is not the thing or the quality 
for which it stands. He understands that words and numbers 
stractions which represent objects, events, qualities, relations, etc., but 
which, in thinking, can be dealt with as if they were the things them- 
selves. This aspect of intelligence—the ability to use symbols—is itself 
the result of an individual’s development and learning. And in its turn, 
the mastery and utilization of symbols promotes further learning—for 
it is hardly necessary to labor the point that without language and 
number, the range of one’s learning would be seriously restricted. 

Ability to carry on abstract thinking, it is easy to see, contributes 
to a person’s ability to adjust or adapt to changing or new situations, 
because through the use of symbols we are enabled to think through 
a problem without spending time and effort on sheer trial and error in 
action; we are enabled to marshal, evaluate, and deal with past ex- 
periences; and we are enabled to project our thinking forward, 
other words, through the use of symbols and abstract thinking, man 
is able considerably to enlarge his range of behaving and adjusting, 


to extend his horizons, and to transcend the immediate concrete 
specific situation. 


are ab- 


In 


and 


TWO COMPREHENSIVE DEFINITIONS 


In more recent years, two definitions of intelligence have 


ap- 
peared which, in effect, combine and extend the three types of d 


efi- 


Two Comprehensive Definitions 63 


nitions already presented. One writer states: * “Intelligence is the 
aggregate or global capacity of the individual to act purposefully, to 
think rationally and to deal effectively with his environment.” The 
reader can readily compare this definition with those already presented 
and analyze it with a view to discerning similarities and differences. It 
will be noted, of course, that this definition encompasses the other 
three. Although it does not specifically mention learning ability, yet 
that is surely implied. Two new aspects, however, are added. The 
definition specifically states that an individual's intelligence is revealed 
by his behavior as a whole (“global”), and that intelligence involves 
behavior toward a goal, which may be more or less immediate (“pur- 
posefully”). A third aspect is presented by the author in his elabora- 
tion of the definition; namely that “drive” and “incentive” enter into 
intelligent behavior. This aspect is probably included and implied in 
capacity “to act purposefully” and “to deal effectively” with one’s 
environment, as stated in the definition. 

The inclusion of “drive,” “incentive,” and the like as aspects of 
intelligence is of very doubtful validity; for to do so is to confuse the 
“the testing instrument, and the results obtained, It is true, of 
ffective utilization of a person’s intelligence depends 
and degree to which he employs it. Nevertheless, a 
single testing device which attempts to combine the measurement of 
intellectual with nonintellectual traits, without providing for differen- 
tiation between the two, would not succeed adequately in either re- 
spect.” This is not to say that in assessing an individual's intelligence 
and personality as a “whole” we should ignore “drive,” “incentive,” 
“interest,” etc.; for the competent psychological examiner will and 
must evaluate these and other nonintellectual traits in presenting his 
test results. Furthermore, as will be seen in later chapters of this book, 
special psychological instruments are available for the evaluation of 
nonintellectual traits of personality which the clinician may use to 
supplement results of intelligence tests if he believes they are neces- 
sary. ae eee E 
aaa: Wechsler. The Measurement of Adult Intelligence, Baltimore: Williams 
and Wilkins, 1944, p. 3- Actually, the usefulness of Wechsler's Bellevue scale is 
not dependent upon his conception of intelligence. This scale does not, in fact, 


incorporate all aspects of the definition. . 
2 The Rorschach test attempts to measure both nonintellectual and intel- 


lectual traits of personality- This test is discussed in Chapter 19. 


issue, 
course, that e 
upon the extent 


64 Definitions and Analyses of Intelligence 


Stoddard * offers the following definition: “Intelligence is the ability 
to undertake activities that are characterized by (1) difficulty, (2) 
complexity, (3) abstractness, (4) economy, (5) adaptiveness to a 
goal,*(6) social value, and (7) the emergence of originals, and to 
maintain such activities under conditions that demand a concentration 
of energy and a resistance to emotional forces.” Here again, the reader 
will note that this definition does in fact include the first three types 
of definitions presented; but it goes beyond these in several respects. 
The author specifies the several attributes of intelligence, and in his 
enumeration are several not included in earlier definitions. 

Degree or level of “difficulty” is implied in all definitions; but Stod- 
dard’s contribution here lies in the fact that he rightly insists we must, 
in testing, distinguish between true differences in degree of difficulty 
and differences that only seem to exist, as between two or more test 
items, whereas, in fact, there are no inherent differences in difficulty, 
For example, the accumulation of pieces of rare information and the 
ability to define unusual words are not in themselves true measures of 
difficulty; they may reflect only differences in experience, On the other 
hand, however, over and above disparity in experiences between vari- 
ous age groups, true differences in difficulty do exist between problems 
that can be solved, let us say, by a group of “average” ten-year-old 
children, and those which can be solved by an “average” group of 
eight-year-olds or nine-year-olds. 

“Complexity” refers to the number of different kinds and varieties 
of tasks that can be successfully dealt with, According to this attribute 
of intelligence, the individual who is able to deal successfully with 
several different kinds of tasks, at a given level of difficulty, is more 
intelligent than a person who can successfully undertake fewer kinds 
of tasks at the same level of difficulty. “Complexity,” however, means 
not simply the addition of one type of performance to others: on the 
contrary, it means the capacity to assimilate new abilities, to integrate 


them with others, and thus to reorganize one’s patterns or forms of 
intelligent behavior. 


“Abstractness”—that is, operating with symbols, especially at levels 
of analysis and interpretation—has already been discussed. For Stod- 
dard, this attribute “lies at the heart of intelligence as defined.” 
“Economy” refers to the rate at which mental tasks are performed 


"8G. D. Stoddard, The Meaning of Intelligence, New York: Macmillan, 
1943, p. 4. 


Two Comprehensive Definitions 65 


and problems solved. Assuming that the problems are solved equally 
well, that the solutions are equally effective, the individual working 
more rapidly would be regarded as the more able, according to this 
attribute. Acceptance of “economy” as an attribute of intelligence 
means that tests would impose time limits which should differentiate 
ar.ong individuals in respect to their rates of performance of tasks 
and solutions of problems at given levels of difficulty and degrees of 
complexity. 

“Adaptiveness to a goal” implies an approach that is more than 
aimlessly meeting and solving new situations as they arise. This at- 
tribute means that intelligent action is directed toward a goal or a 
purpose. The more comprehensive the goal, the larger and more 
complete the purpose, the more is intelligent action required. 

The student, after examining representative tests of intelligence, 
might well question whether they do, or even could, satisfactorily test 
this last attribute; or whether the problems and tasks included in the 
tests are rather oversimplified and segmental examples of problems 
and courses of action that a person has to confront and deal with in 
actual life situations. If the test items are of the latter kind, then their 
value and validity as measures of intelligence must be shown by the 
fact that they do indeed predict to an adequate degree the manner and 
effectiveness with which the testee will deal with and solve actual life 
situations of broader scope. In other words, what are the predictive 
values of the items, tasks, and problems included in a test? 

The inclusion of “social value” as an attribute of intelligence is of 
doubtful validity, and debatable at best; for this criterion is essentially 
moral or ethical, or a matter of subjective evaluation. The basis of 
“social value” is group acceptability. If this attribute were applied in 
evaluating intelligence, We should have to minimize our estimates of 
the intelligence of individuals whose thinking and solutions of prob- 
lems are not necessarily consistent with accepted social forms, though 
they might be “ahead of their time”; and of individuals who are capa- 
ble of difficult and complex mental operations, but whose mental ac- 
tivities lead to no apparent or demonstrable practical and social 
values. While we may well value more highly the individual whose 
mental operations culminate in desirable, acceptable, and useful social 
outcomes, the inclusion of “social value” as an attribute of intelligence 
would aonitived attempts to measure the other, and valid, attributes by 
injecting largely subjective conceptions of what is socially acceptable 


66 Definitions and Analyses of Intelligence 


or unacceptable or indifferent. It will be seen later that “social value” 
is hardly present in current tests of intelligence; although, of course, 
some psychologists, like Stoddard, take the position that it should be. 
“The emergence of originals” as an attribute of intelligence is the 
ability to create something new and different; it is a characteristic of 
a high order of thinking and of individuals at the superior end of the 
distribution of intelligence. Examples of this attribute in operation are 
the development of a new scientific principle, the discovery of unique 
relationships in observed data or phenomena, the development of a 
new machine-design, the development of a new technological process, 
the new organization and new interpretation of historical or social 
facts, a creatively original painting or musical composition. It is un- 
doubtedly true that current tests of intelligence provide little oppor- 
tunity for the measurement of emergence of originals. The question, 
then, so far as these tests are concerned, is whether this attribute, 
creative originality, is really dependent upon a combination of abilities 
which are actually measured by available tests and whether the results 
obtained by means of the tests are indicative of the degree to which a 
person possesses this attribute. Some psychologists, Stoddard among 
them, believe that at present the tests of intelligence do not satisfacto- 
tily discern and rate an individual's intellectual originality. Others 
maintain that the hierarchy of abilities establish 
tests enables us to identify the 
greater or lesser degree.‘ 


Stoddard’s last two conditions of intelligent behavior—“concentra- 


tion of energy” and “resistance to emotional forces”—are subject to 
the same criticism as Wechsler’s inclusion of “drive” 


Motivation and ability to exert sustained effort 
as nonintellectual aspects of activity and are certainly recognized as 
playing highly important roles in one’s general effectiveness. But to 
introduce them into a test of mental ability would be to confuse and 


probably to invalidate efforts to arrive at a reasonably valid measure 
of the level of intelligent activity at which 


ed by means of the 
persons who possess originality in 


and “incentive.” 
are usually regarded 


a given person is able to 


‘The observation of psychologists is that people who show creative origi- 
nality, almost without exception, score high or very high on intelligence tests; 
but many persons may score very high on these tests without having exceptional 
powers of originality. We need tests of originality, but in view of the very nature 
of the concept and its expressions, such tests cannot very well be standardized. 


| 
| 
| 


Two Comprehensive Definitions 67 


operate, regardless of whether he actually does operate at that level in 
all situations. 

While tests of intelligence do not directly measure motivation and 
concentration of energy on the solution of problems, the psychological 
examiner does in fact try to develop or encourage conditions wherein 
the persons being examined will operate at their maximum levels of 
ability. This can be more nearly achieved in administering an indi- 
vidual test than in administering group tests. Furthermore, if an in- 
dividual is not adequately motivated, is not expending a maximum of 
energy during the test, or is handicapped by emotional factors during 
the testing, these conditions can be discerned much more readily dur- 
ing the examination of one person at a time than during the examina- 
tion of a group all at once. Indeed, during group testing there may be 
instances of persons whose test results are vitiated by the effects of the 
unfavorable conditions mentioned without that fact being known to 
the examiner. This possibility is a disadvantage of group testing. 

Of course, no single test or series (“battery”) of tests can provide 
an unfailing index or a guarantee of motivation, energy output, or 
freedom from emotional blocking in all future situations requiring 
intelligent behavior. For man is not a static being; nor does the en- 
vironment in which he lives remain static. Long-term motives and im- 
mediate incentives will change; values and interests will change. The 
affective (emotional) quality of a person’s experiences will influence 
his subsequent behavior, including situations requiring the utilization 
of intelligence. Tests of intelligence now in use are not intended to 
determine the extent to which an individuai will in the future concen- 
trate his energy on problems demanding the use of his intelligence, nor 
to determine whether it is probable that he will be able to remain free 
from emotional blockings. A variety of personality rating scales and 
inventories and projective techniques have been devised to evaluate 
these nonintellectual traits. Although tests of intelligence can and will 
no doubt be improved so that greater demands will be made upon 
concentration of attention and sustained effort than is the case with 
Most tests at present, psychologists believe they are warranted in as- 
suming that a qualified examiner will be able to determine whether or 
not a given person’s performance on an individual test represents his 
maximum level at that time. They believe, also, that most persons can 
be motivated to perform at their best levels when taking a group test. 


68 Definitions and Analyses of Intelligence 


In any testing of groups this must be reasonably well assured, as well 
as assumed.” 

Although, as stated, intelligence tests are not designed to measure a 
person’s emotional status and other nonintellectual aspects of person- 
ality, there is at present a trend, especially among clinical psycholo- 
gists, to analyze test performance for evidence of emotional states, 
personality “dynamics,” and for “differential diagnosis” (that is, for 
symptoms of neuroses, psychoses, or other atypical states). This trend, 
and the basic idea, while still in a rather early experimental stage, are 
not without merit and some validity. This aspect of test interpret: 
which demands sensitive clinical insights, will be discussed in 
chapter under scatter analysis and under 
behavior. 


ation, 
a later 
projective analyses of test 


IMPLICATIONS FOR TEST DESIGN AND CONTENT 


Definitions of intelligence are of more than theoretical impor- 
tance. The conception of intelligence which a psychologist holds will 
affect, to some extent at least, the content of the test he develops. Yet, 
at the same time, an examination of a representative group of tests 
reveals the fact that although some are different from others in certain 
aspects, they all, nevertheless, do have much in common. It would be 
incorrect to say, for instance, that certain tests exemplify exclusively 
the definition that intelligence is the capacity to learn. The fact, then, 
that psychologists emerge with tests having considerable similarity, 
though they might start with different definitions, must mean that their 
definitions differ largely in respect to emphases and that, as already 
pointed out, they are interdependent, 

Early experimenters in mental testing attempted to measure general 
intellectual capacity by means of a single type of test which measured 
only a single capacity, usually a sensory process, or association, or 
attention. Thus they identified general intellectual capacity with a 


5 The “catch” lies, however, in the fact that in any 1 
being tested it is not unlikely that there will be a few, 
quately motivated or who are handicapped by emoti 
a source of error in group measurement. The discov 
or blocked individuals will depend upon whether o 
the test is scrutinized and evaluated in the light of ot 
evidence. Where there is reason to believe that grou 
ously low in the case of a given individual, it is de 
an individual rather than a group test. 


arge group of persons 
at least, who are not ade- 
onal difficulties. Herein is 
ery of these nonmotivated 
T not each one’s rating on 
her and perhaps conflicting 
P-test performance js spuri- 
sirable to re-examine, using 


Three “Kinds” of Intelligence 69 


single function. Their efforts were fruitless. Later, however, experi- 
ment showed that a variety of test materials yielded more accurate and 
more useful results when validated against accepted criteria of intel- 
ligent activity. Psychologists, in seeking to encompass a greater variety 
of items in their tests, and thus to produce more useful and successful 
instruments, regardless of the exact definition with which each one 
started out, found that inevitably their testing instruments were 
broader than their definitions. Current tests thus have more than a 
n in spite of differences in details of their content. In- 
spection of their content will show that in varying degrees they are 
measures of some aspects of learning in what is assumed to be a rea- 
sonably uniform environment for all persons," that novel situations 
and problems are presented, and that ability to carry on abstract 
thinking is tested through the utilization of symbols and ideas. In- 
a also that most of the tests fail to meet the 
attributes suggested in Stoddard’s 


little in commo 


spection will demonstrate 
more comprehensive and long-term 
definition. 

So far as avail 
mind that the Frenc 


able tests are concerned, it is important to bear in 
h psychologist Binet (the “father” of modern men- 
tal testing) took the position that it made little difference what specific 
tasks and items were incorporated into a test provided that in some 
degree each part was a measure of the individual’s general capacity. 
Whether or not this condition is met will depend, of course, upon the 
definition of intelligence regarded as most adequate by the designer 
of a test and upon the criteria of intelligent activity against which test 
results are checked for validity. Developments have been such that 
in spite of some differences in definition and in spite of some differ- 
ences in external appearances, psychologists believe that their tests are 
reasonably sound because they are related to and have value in pre- 
dicting the likelihood of intelligent activity in life-situations. 

THREE “KINDS” OF INTELLIGENCE 

ologists believe that several kinds of intelligence 
should be distinguished from one another. Noteworthy among them is 
E. L. Thorndike who has divided intelligent activity into three types: 
namely, (1) social intelligence, or ability to understand and deal with 
persons; (2) concrete intelligence, or ability to understand and deal 


-debated problem of heredity and environment. as fac- 
lligence. 


Some psych 


ë This raises the much-de 
tors in the development of inte 


70 Definitions and Analyses of Intelligence 


with things, as in skilled trades and in working with the appliances of 
science; (3) abstract intelligence, or ability to understand and deal 
with verbal and mathematical symbols. 

The merit of this classification of types of intelligent activity, for 
psychological testing and diagnosis, lies in the fact that it indicates 
several realms in which persons might be functioning and implies that 
separate and sufficiently specialized tests might be devised to measure 
how effectively persons are functioning in each. 

While it is true that in the case of any given person the scores at- 
tained on a test of ability to deal with verbal and numerical abstrac- 
tions might differ appreciably from those attained by him on a test of 
social relationships and insights, or on one of “concrete” intelligence, 
it is true, nevertheless, that when a representative group of individuals 
are tested, the correlations between the types of tests are found to be 
positive and significant, both statistically and psychologically, For ex- 
ample, correlations between tests of verbal and of concrete abilities 
vary from about .25 to about .45, the average being about .30-.35. 
While this is a somewhat low average correlation, it still indicates that 
some communality of function is being measured. This index, being 
so far from unity, also means that there are numerous individuals 
whose relative scores do not correspond closely or whose two relative 
scores may be discrepant. This fact points up the important psycho- 
logical principle that the data and status of any single person may be 
inconsistent with the general trend. Study of the individual, and the 
ways in which and the reasons why he deviates from or exemplifies 
general trends, is one concern of the clinical psychologist. 

Of the three kinds of abilities enumerated above, abstract intelli- 
gence is the one that receives greatest weight and is most pronounced 
in current tests of intelligence—that is, whenever the test is designed 
for use with persons who are presumed to have reached a level where 
they reasonably may be expected to have developed facility in dealing 
with concepts and symbols. 

Even tests that present the subject with “things” rather than with 
ideas and symbols are not devoid of demands upon ability to con- 
ceptualize and make abstractions, although testees need not necessarily 
state these in the form of language and number, For example, when a 
subject is required to arrange a series of pictures into a sequential and 
meaningful whole, he must at some stage form a concept of “the 
whole” if his response is to be correct by some other means than pure 


Three “Kinds” of Intelligence 71 


chance. He must do this, also, in assembling parts into an integrated 
unit (called “object assembly”). The same is true of the child who is 
asked which is the “prettiest” of two pictures (“aesthetic compari- 
son”), for he must have a concept of “prettiness,” however unarticu- 
lated it may be. There are in use many other types of test items that 
deal with things but still require more or less ability in concept for- 
mation, Among these are—to name a few—object classification, trac- 
ing the shorter of two routes in a maze, identifying objects by use, 
supplying missing parts in the drawing of a human figure, etc. In short, 
the fact that some types of test items do not employ language or num- 
ber does not necessarily signify that they make no demands upon 
ability to reason at a level of concept formation and abstraction. 

It is true that at the earliest developmental levels there are tasks 
that depend upon visual-motor skill, such as tying a bow knot, grasp- 
ing a ring, holding a pencil and scribbling, manipulating cubes, and 
the like, These types of tests, however, are restricted principally to the 
first eighteen months of life. They are useful as developmental 
indi sators, but they have only slight predictive value for later devel- 


opment of mental abilities, as measured by tests at more advanced 


levels. 
The role of ability to deal with ideas and symbols (words and 
numbers) as a measure of concept formation and abstraction is of 


increasing importance in tests of general ability (intelligence) as age 
level increases. Proportions of verbal and numerical tests, on the one 
hands and nonverbal, nonnumerical, on the other, undergo change at 
different age levels; è some tests include a larger proportion of the 
latter ihan do others, even at the adolescent and adult levels.” These 
differences are not haphazard, nor are they matters of individual 
whim: they depend upon the purposes of the test and the test author’s 
conception of intelligence and its constituent parts. It will be seen in 
later chapters that the correlations between various tests of general 
ability are quite marked—and at times high or very high—thus indi- 
cating that to an appreciable degree in these tests the verbal and 


“Scales for Infants and Preschool Children,” in which this 
length. 

ford-Binet Scale below the age-5 level. Also the Mer- 
a Preschool Test, the Detroit Kindergarten Test, 


7 See Chapter 8, 
problem is discussed at 
“Cf. the Revised Stan 
rill-Palmer Scale, the Minnesot 
etc. 
° Cf. the Dearborn Gro 
the Wechsler-Bellevue Inte 


up tests. the Otis tests, the KuhImann-Anderson tests, 
Jligence Tests, the Pintner-Paterson scale. 


72 Definitions and Analyses of Intelligence 


numerical items on the one hand and the nonverbal, nonnumerical on 
the other are measuring the same or closely related functions.” High 
intercorrelations do not always mean that the same functions are being 
measured by the tests concerned; such correlations may reflect other 
common factors which affect the tests being correlated. This, however, 
is quite improbable as an explanation of test intercorrelations. 


ANALYSES OF MENTAL ABILITY 


The definitions of intelligence thus far discussed are functional 
in character; that is, they state how intelligence operates: through 
learning, adaptation, abstract thinking. But, in addition, psychologists 
have been concerned to know the fundamental nature and structure 
of intelligence. They have made analyses in an effort to determine its 
underlying factors. Or, otherwise stated, the purpose of these analyses 
has been to discover, if possible, the elements, or components, of in- 
telligence, not only for a better theoretical understanding of this com- 
plex process, but also to learn what might be the implications for the 
design and construction of mental tests, 

It is not to be inferred, however, that the dynamics of intelligent 
activity can be adequately understood merely by enumerating and 
characterizing the components, whatever they might be. Whatever the 
components, they do not operate independently or in isolation, Under- 
standing the dynamic aspects of mental activity requires some means 
of characterizing the organization of factors, their interrelationships, 
and their relation to motivational forces, 

Essentially, the experimental method followed is this: a rather large 
number of separate tests, more or less diverse in ch 
to an adequate sampling of the population. The re 
of test are correlated with those of all the others. 
correlation are then subjected to various techniques of statistical 
analysis in an effort to discover the extent of common ground between 
them (technically known as communality), and their degree of in- 
dependence. These statistical methods are known as factor analysis." 
The particular theory or structure of intelligence educed from the sta- 
tistical operations will depend upon the expert’s own interpretation of 


aracter, are given 
sults of each type 
The coefficients of 


10 The statements in this Paragraph do not mean th 
in the tests are measures of mechanical ability. Gene 
the authors of the tests to measure the same psychological processes as do the 
verbal materials, but by means of different content. 

1 To be discussed later in this chapter. 


at the nonverbal materials 
rally, they are believed by 


Analyses of Mental Ability 73 


the analysis; and the experts differ in their interpretations. These dif- 
ferences, however, need not invalidate the use of well-standardized 
psychological tests; for, as will be seen, theoretical differences thus far 
have not had far-reaching consequences as regards the kinds of intel- 


ligence tests constructed. 


The Multi-factor Theory. Thorndike’s multi-factor theory of intelli- 
gence is at one extreme of the interpretations regarding the nature of 
mental organization. As the name of the theory indicates, intelligence 
is said to be constituted of a multitude of separate factors, or elements, 
each one being a minute element of ability, Any mental act, according 
to this theory, involves a number of these minute elements operating 
together. Any other mental act also involves a number of the elements 
in combination. Hence, if performances on these two mental tasks are 
positively correlated, the degree of correlation is due to the number 
of common elements involved in the two acts. If two types of mental 
activities, A and B, are more highly correlated than are A and C, the 
reason, according to the multi-factor theory, would be that the first 
pair has more elements in common than does the second pair. Accord- 
ing to this theory, then, there is really no such factor as “general in- 
telligence”; there are only many highly specific acts, the number of 
such depending upon how refined a classification we might wish to 
make and are capable of making. 

Thorndike’s is really an “atomistic” theory of mental ability. He 
adds, however, that certain mental activities have so many of their 
elements in common that it is useful to classify these tasks into sepa- 
rate groups to which special names are given, for example, verbal 
meaning, arithmetical reasoning, comprehension, visual perception of 
relationships, and others. Consequently, in constructing a; mental test, 
it appears even to Thorndike himself that his “atomistic theory and 
the multitude of minute elements of ability are of less practical sig- 
nificance than the conception that many of them operate together in 
any situation demanding intelligence. This is illustrated by Thorn- 
dike’s test designed to measure ability to deal with abstractions. His 
test is composed of four parts: sentence completion (C), arithmetical 
reasoning (A), vocabulary (V), and following directions (D). This 
instrument is known as the CAVD test. It is not claimed by Thorn- 
dike that these four sets of items encompass the entire range of ab- 


12 This, of course, iS true of all sciences. 


74 Definitions and Analyses of Intelligence 


stract intelligence. They represent and sample only certain parts; but 
because of the very significant correlations between all types of meas- 
ures within the tested range, it is held, the other aspects of abstract 
intelligence can be estimated with satisfactory accuracy from those 
portions that are actually measured by this test. 


The Two-factor Theory. Opposed to Thorndike’s theory of the na- 
ture of intelligence is Spearman’s two-factor theory, which stands at 
the other extreme of interpretations. According to Spearman 
tellectual activity is dependent primarily upon and is an expression of 
a general factor common to all mental activity. This factor, designated 
by the symbol g, is possessed by all individuals, but in varying degrees, 
of course, since people differ in mental ability; and it (g) Operates in 
all mental activity, though in varying amounts, since mental tasks dif- 
fer in respect to their demands upon general intelligence, Spearman 
characterized this general factor as mental energy, because in the 
realm of intelligent activity, he maintained, it has a role similar to that 
of physical energy in the physical world. Like all other scientific con- 
cepts, the general factor can be observed an 
its specific manifestations—in this instance, through psychological 
tests. After analyzing tests with varying amounts of the general factor, 
from high to low, Spearman concluded that the principal distinguish- 
ing characteristic of tests highly “loaded” with g is that they require 
insight into relationships—what he called “the eduction of relations 
and correlates.” For example, in solving an arithmetical problem, the 
subject has to grasp the relationships between the data presented, or- 
ganize them with reference to the Propositions given in the problem, 
and deduce a correct answer. The g-content in this task is high. By 
contrast, if the subject merely has to repeat a table of multiplications 
or add a few numbers—both of which can be learned by rote—no 


insights are necessary and no relationships need be grasped. In this 
task, the amount of g involved is very small." 

Spearman postul 
relations that he fi 


, all in- 


d known only through 


ated the g factor, in the first place, to explain cor- 
ound to exist among diverse sorts of perceiving, 


When using an individual test, like the Stanfor 
has often been observed by examiners that a sub 
arithmetical problem—unable even to make a start toward a solution—m, 
still be able to perform the separate arithmetical Processes involved 
man’s terms, such an individual is unable to educe the ne 
correlates, for lack of the necessary amount of g. 


d-Binet or the Bellevue, it 


ject who is unable to solve an 


ay 
. In Spear- 
cessary relations and 


Analyses of Mental Ability 75 


knowing, reasoning, and thinking, as illustrated in Table 9. That is 
to say, he concluded that all mental activity is to some extent depend- 
ent upon and an expression of this general factor; and the magnitude 
of the correlation coefficient found between any two forms of mental 
activity reveals the extent to which this g factor is operative in each 
and common to both. Thus, the amount of the general factor operating 
in each activity will determine the size of the correlation between the 
two mental activities being measured. The types of materials used in 
current tests of intelligence—word meaning, arithmetical reasoning, 
sentence completion, reasoning by analogy, paragraph interpretation, 


TABLE 9 


Intercorrelations of Subtests a 


1 2 3 4 5 6 7 
(1). Avldgies < cs oxsee escent cies so 49 «55 49 ASH 
(2) Completion ..---++++3°° SO ee ee OE BD 38 34 
(3) Understanding paragraphs E E E E 
(+) Opposites. ..ss0ss et eee ae ya BY NSR 
(5) Instructions sarene BO: 2500 439: bl 348 32 40 
(6) Resemblances 45s 38 M ade ad? 35 
(7) Inferences -stesen ee) SA aD ebro 35 


perception of relationships in geometric forms, picture completion, 
and others—all show significant degrees of positive correlation with 
One another. Spearman and his supporters at first ascribed this fact 
to the presence of g, in greater or lesser amount, in all of them. Later 

onclude that certain “group factors” are also 


researches led them to € 
present in some mental activities. These are the factors that occur in 
more than one type of test item, but in less than all of any given set of 


tests. The general factor. however, still remains the primary and per- 


vasive one. 
Since the intercorrelations are by no means perfect, Spearman 
postulated the existence of specific factors, called s factors, each of 


which is specific to a particular type of activity. Thus, the two-factor 
theory states that all mental activities have in common some of the 
general factor; each mental activity might also be a member of a 
“group”; and each has also its own specific factor. Of the kinds of 
factors, the general one is regarded as the essential measure of intelli- 


From C. Spearman, The Abilities of Man, New York: Macmillan, 1927, 


P. 149. (By permission.) 


76 Definitions and Analyses of Intelligence 


gence; accordingly a sound test of intelligence is one that will sample 
adequately the g factor in a variety of activities, and the best test ma- 
terials are those which call for the largest amount of the general factor. 
And the largest amounts of the general factor are believed to be de- 
manded by those types of test materials that have the higher inter- 
correlations with one another. 

As a matter of fact, since the beginning of modern mental testing, 
psychologists have proceeded, at least implicitly, on the assumption 
that all forms of mental activity have something in common—that 
they are similar in certain basic respects, Otherwise, psychologists 
could not have justified their practice of testing together in 
instrument such diverse mental activities as defining words, 
arithmetical problems, finding similarities and differences, re 
digits forward and backward, completing sentences in a me 
manner, perceiving geometric forms, etc. All of these, 
used, must have been regarded as being measures, to a greater or 
lesser degree, of general intelligence. From the total performance on 
these tests, it was believed that an individual's level of general in- 
telligence would emerge. Therefore, psychologists believed they were 
justified in adding up the test items correctly passed in the several 
types of activity and deriving a single total score to represent an indi- 
vidual’s general intelligence level.” This is the actual practice being 
followed in nearly all tests, including individual as well as group scales 
of mental ability. 

The practical implication of the Spearman two-factor theory is 
clear, so far as test construction is concerned, A test conforming to 
this theory would be one whose materials and several parts are satu- 
rated with the general factor so that measurement thereby would cause 


a single 
solving 
peating 
aningful 
and the others 


™ While this practice is not being discontinued, and 
emphasis is now being placed on the desirability of rı 
ual by means of a test profile, where Possible 
There are some psychologists, however, who would abandon the use of all i 
dexes of general level and would substitute a profile representing the individ- 
ual’s relative rank in each of the specific types of test materials being used; e.g., 
numerical ability, word meaning, spatial perception, and the like. 

16 In addition to g and s, Spearman and others have found by 
of experimental results that there are some nonintellectual f: 
volition, interest, persistence—which influence a person’s effectiveness, Spear- 
man and adherents of his theory have also discerned a few groups of factors 
that are intermediate between g and the highly specific s. They suggest that 
musical aptitude and mechanical aptitude are of this type. 


should not be, increased 
epresenting each individ- 
> as well as by a general index. 


further analysis 
actors—such as 


Analyses of Mental Ability 77 


the testee’s level and quality of g to emerge, while the effects of specific 
factors (s) would be canceled out. Thus, the net result of the test 
would be a measure of g. To achieve this would require a skillful 
selection and development of test problems and parts that are sig- 
nificantly intercorrelated, which at the same time satisfy the practical 
criteria of intelligent activity. Such a test, presumably, would yield an 
index which reflects the caliber of a particular mentality working as 


a whole. 


The Group-factor Theory. 
Thorndike and Spearman are the group-factor theories, from among 


which we select for presentation that of Thurstone because it has been 
most highly developed, has received most consideration, and has re- 
sulted in the construction of a set of measures called tests of primary 


mental abilities. 


According to th 
an expression of innumera 


Intermediate between the theories of 


e group-factor theory, intelligent activity is not 
ble highly specific factors, as Thorndike 
claimed. Nor is it the expression primarily of a general factor which 
pervades all mental activity and is the essence of intelligence, as 
Spearman held. Instead, the analyses and interpretations of Thurstone 
and others led them to the conclusion that certain mental operations 
have in common a “primary” factor which gives them psychological 
and functional unity and which differentiates them from other mental 
operations. These mental operations, then, constitute a “group.” A 
second group of mental operations has its own unifying “primary” 
factor; a third group has a third; and so on. In other words, there are 
a number of groups of mental abilities (the number being as yet un- 
determined) each of which has its own “primary” factor, giving the 
group a functional unity and cohesiveness. Each of these “primary” 
factors is said to be relatively independent of the others. 

After administering a large variety of types of test materials to col- 
lege students and to high-school and eighth-grade pupils, and after 
making correlational analyses of the results, Thurstone and his col- 
laborators concluded that six “primary” factors emerged clearly 
enough for identification and use in test design and construction. They 


are, briefly, the following.” p i 


prom L. L Thurstone and T. G. Thurstone, The Chicago Tests of Primary 
Mental Abilities, Manual of Instructions, Chicago: Science Research Associates, 
1943, p. 7. See also L. L- Thurstone, Primary Mental Abilities, Psychometric 
Monograph No. 1, Chicago: University of Chicago Press, 1938; L. L. Thurstone 


78 Definitions and Analyses of Intelligence 


The Number factor (N): “ability to do numerical calculations 

i urately.” 
mire Yv): “found in tests involving verbal compre- 
cn ee factor (S): “involved in any tasks in which the subject 
manipulates an object imaginally in space.” ; $ 

The Word Fluency factor (W): “involved whenever the subject is 
asked to think of isolated words at a rapid rate.” 

The Reasoning factor (R): “found in tasks that require the subject 
to discover a rule or principle involved in Series or groups of letters.” 
Although it is believed both induction and deduction are involved, it 
seems that induction is the more significant here. 


The Rote Memory factor (M): involving “the ability to memorize 
quickly.” 


In spite of the fact that “primary” mental abilities (or factors) 
were originally said to be functionally independent of each other, 


it 
was actually found that they are positively and significantly interco 


r- 
TABLE 10 
Intercorrelations of Subtests ' 


N Ww Vv S M R 
N TE 
W Al ie 
Vv 40 a ris 
S 28 17 -16 nas 
M 31 36 35 13 ne 
R ne 49 59 ag 39 


(N, number facilit 


y; W, word fluency; 
S, spatial percey 


V, verbal meaning; 
ption; M, rote memor 


y; R, reasoning) 


related, as shown in Table 10. This must mean that the “primary” and 
presumably independent factors are not the only factors at work in 
the mental activites required by the tests. There must be some other 


and T. G. Thurstone, Factorial Studies of Intelligence, Psychometric Mono- 
graph No. 2, Chicago: University of Chicago Press, 1941, Some modifications 


of factors have been introduced in the most recent issues of these tests for 
younger subjects. The six named above a 


The “primary” mental abilities do not i 
abilities. They do not include mechanical, musical, or 
“primary” abilities involve largely, though not entirely, t 
in “abstract intelligence” and in the academic types of learning. 


18 


1943. (By permission. ) 


Analyses of Mental Ability 79: 


factor, or factors, to account for the common ground (as shown by 
the positive correlations) that exists between the various psychological 
tests intended to measure these “primary” factors. In other words, it 
seems that the test authors have not been able to devise test materials 
which will sample the “primary” mental abilities in “pure” form. The 
Thurstones, therefore, concluded that in addition to the “primary” 
abilities there is a “second-order general factor.” They also stated, in 
their earlier test manual, that “Tf further studies of the primary mental 
abilities should reveal this general factor, it may sustain Spearman’s 


intellective factor.” 

Subsequent studies of the “primary mental abilities” do tend to re- 
veal a general factor. The more recent intercorrelations found among 
the several tests that make up the PMA 2° batteries are quite marked, 
especially at the lower age levels (when abilities are less differentiated 
through education and interest than they will be in later years). 

For the tests at the 5 to 7 year level, the intercorrelations range 
from .46 to .67 (average equals .55). For tests at the 7 to 11 year 
level, the range of coefficients is from 41 to .70 (average equals .50); 
while for ages 11 to 17, the range of coefficients is from .13 to .59 
(average equals 30+). It appears, thus, that the group-factor ad- 
herents have found it necessary to posit the operations of a general 
factor; but at present they regard it as being of a “second order.” 

In evaluating the group-factor hypothesis we need not question the 
soundness of the statistical methods used or the comprehensiveness of 
the experimentation.” Several observations are necessary, however, to 
enable the reader to make a fuller assessment of the hypotheses, In 
the first place, intelligence is not an entity which operates in a vacuum, 
it is not something “given,” even in the sense that some physical traits 
are “given,” such as the color of eyes and hair, the number of digits, 
ete. Intelligence is, rather, a name for certain kinds of activity; we can 
know of it only through its manifestations in behavior. Intelligent be- 
havior develops and is manifested in one kind of environment or an- 
Other; hence, the particular form of expression that intelligent activity 
takes will depend upon the sort of functions which are developed and 
fostered in a given cultural environment. In our own and similar cul- 


 Tbid., p. 7. 


20 Prj al Abilities- . 
21 ee have criticized adversely Thurstone’s methods and his. 
interpretations of his findings- 


80 Definitions and Analyses of Intelligence 


tures, verbal and numerical abilities are, of course, essential; they are 
fostered from earliest childhood, and they receive greatest attention 
and emphasis in our schools. There is thus a relationship between this 
cultural emphasis and the fact that three of Thurstone’s six “primary 

factors are concerned with words and numbers, It is probable also that 
the “Space” factor emerges from statistical analyses because of our 
experiences with things in three-dimensional space. The two remaining 
factors, “Reasoning” and “Rote Memory,” are characteristic, in 
greater or lesser degree, of all persons regardless of the particular 
culture; we should therefore expect to find them as factors of intelli- 
gent activity in any analysis. Furthermore, the “Reasoning” factor is 
very likely much the same as Spearman’s g, although the latter has 
been presented as having several aspects.** In effect, then, the point 
is that some of the particular factors through which intelligence is ex- 
pressed are developed by experience and education; and these par- 
ticular factors—such as the six “primaries’—may well be conceived 
of as particular manifestations of a general ability rather than as “pri- 
mary” abilities, 

The proponents of the group-factor hypothesis do not claim that 
the exact number of “primary” mental abilities is known. Hence, we 
must caution the reader against assuming that there is a finality about 
the present number. For example, a factor of “Speed” will not appear 
in a statistical analysis of test results unless speed of performance is, 
first, a variant in the population being measured and, second, a re- 
quirement within the tests themselves. The same can be said of “per- 
sistence” or “mental fatigue.” Similarly, “originality” would appear as 
another factor, if it could be measured. 

These observations do not invalidate the tests that have been or 
might be designed and constructed on the basis of the group-factor 
hypothesis. As a matter of fact, the contents of such tests thus far 
published, though differently organized, are not radically different in 
their essentials from those that have been designed and constructed 
on the basis of either of the two other theories of the nature of intelli- 
gence. 

For our present purposes, two consequences of group-factor analy- 


ses are indicated. First, the conceptual framework has resulted in more 


clearly specified and defined test categories and types of test items 


* Apprehension of one’s own experience, the eduction of relations, and the 
eduction of correlates. 


Analyses of Mental Ability 81 


than was the case previously. Second, several batteries of tests have 
been constructed on the basis of group-factor theory, particularly the 
Thurstones’ tests of Primary Mental Abilities. 

The early versions of the PMA tests did not yield an over-all index 
of performance, such as mental age, intelligence quotient, or an over- 
all percentile rank. Instead, they gave for each subject separate per- 
centile ranks to represent his performance level in each of the “pri- 
mary” factors. These ranks were then used to make, for each person, 
a “mental profile” for purposes of educational and vocational guid- 
ance. While they granted that the single index (mental age and IQ), 
based upon a variety of mental activities, is useful for many practical 
purposes, group-factorists originally maintained that their method of 
finding separate ratings for each of the “primary” factors enables the 
examiner more readily and adequately to recognize a testee’s marked 
mental abilities and disabilities, the degree of uniformity or lack of 
uniformity.” 

There is merit in this contention; yet, at the same time, there is no 
good reason why the group-factor type of test should not also yield 
an over-all rating (such as MA or IQ) as well as indexes of relative 
rank for each of the specified factors. While mental age and intelli- 
gence quotient should never be interpreted and used mechanically and 
uncritically, OF in disregard of the specific performances that have 
contributed to them, they do nevertheless have considerable signifi- 
cance and valuable connotations for the qualified examiner and inter- 
preter. 4 ; MEN ; 

The group factorists have apparently recognized this point in their 
more recent interpretations of test findings. Equally important to them 
—if not more so—is the fact that their correlations and factorial 
analyses have persistently yielded results that could not be explained 
in terms of group factors alone, and that the g factor was indicated. As 
a result, the most recent editions of the PMA tests provide IQ equiva- 


2 This can be done by a competent examiner when another group test or the 
Binet or the Bellevue is used. In the case of the last two the “scatter of test 
Performance is studied. In the case of group tests, the individual s performance 
Gh each uf the several parts can be compared with his performances on the 
Other parts. The disadvantage here lies 1n the fact that the usual group test does 
Not provide separate scores and norms for each of the parts. r 
A weakness of the group-factor type of test Is this: the breakdown into sepa- 
h hat intelligence expresses itself in behavior as a 


rate factors ignores the fact tH : j 
combination, a unity of functions, not as a series of independent factors. 


82 Definitions and Analyses of Intelligence 


lents for the scores on the scale for ages 11 to: 17, while for the 
younger age levels both MA units and quotients are provided. 

As is so often the case in scientific problems—especially in the rela- 
tively new ones—divergent theories in time tend to come into closer 
agreement. The Spearman Two-Factor Theory now recognizes that 
some group factors should be posited to explain test findings; but em- 
phasis is upon the g factor. Perhaps the Spearman theory may now 
be renamed “The General Factor-Group Factor Theory,” while the 
other might be renamed “The Group Factor-General Factor Theory.” 
The narrowing of differences between the two theories represents sig- 
nificant scientific progress. 


FACTOR ANALYSIS 


The two-factor and the group-factor theories are the two most 
prominent examples of doctrines emerging from the methods of f: 
analysis. Although this subject is highly technical 
explain it somewhat more fully at this stage. 

The technique is essentially a search for the psychological functions 
which are at the basis of and determine test performance, All tech- 
niques of factor analysis are statistical and based upon the correlation 
coefficient. After the statistical calculations have been made, it is nec- 
essary for the investigator to bring to bear his psychological insights 
to interpret and name his statistical findings. Tests cont 
of items. What psychological functions do the various t 
have in common? Are there functions in common bet 
tests of verbal performance? Between verbal and numerical? Between 
spatial perception and numerical ability? Between reasoning with 
verbal and with nonverbal materials? These are among the questions 
the factor analyst seeks to answer. After he has found his answer, at 
least tentatively, he proceeds to construct a scale in which items are 
included and so grouped as to measure only, or almost solely, the 


factors he has segregated from his preliminary testing and statistical 
analysis. 


actor 
> it is desirable to 


ain a variety 
ypes of items 
ween various 


The factor analyst does not begin with a definite set of preconceived 
mental functions. He tries to discover which psychological functions, 
or components, are necessary to explain his data. Yet, it should be 
noted, he must at the very outset have some conception of the kinds of 
test items to include in preliminary experimentation, Thus what he 
ultimately distills out as factors is basically dependent upon his origi- 


Factor Analysis 83 


nal conceptions regarding his preliminary items. The factor analyst, in 
secking the components of “intelligence,” for example, does not start 
with tests of color perception, tone discrimination, or finger dexterity. 


Two-factor Theory. We have already stated Spearman’s two-factor 
theory. It will be helpful now to describe in more detail the reasoning 
whereby the theory was arrived at. Spearman, in his early experimen- 
tation, was impressed by the fact that all the intercorrelations were 
positive in a table of coefficients for various types of items. He was 
also impressed by what appeared to be a hierarchy of coefficients in 
the rows and columns of the table; not perfect gradations, but strong 
evidence of proportional gradations. ° He therefore offered a hypo- 
thetically perfect table of correlation coefficients to illustrate his point 
(Table 11). 
TABLE 1,1 

11'S IIypothetical Table of Correlations *' 

1 2 3 4 5 
20 of 30 3 


Spearm 


1 Opposites oaks sere Ae 
a Completion South eee EE SO... 48 24 24 
(3) Memory sasessi teia ee taes 00 a 2e .18 J 
(4) Discrimination ois @ a N ‘a : 
(5) Cancellation ..-+++eer ett 3 2 ls $ 


Not only are the coefficients positive and in a decreasing order along 
rows and columns, but theoretically any two columns (or rows, since 
the table is symmetrical about the diagonal which contains the self- 
correlations) are in direct proportion. The criterion of proportionality 
requires that the following correlational relationships should hold: 


Taking only the first two ratios, 


and multiplying by the demoninators, we have 


fe The 


% From C. Spearman 
P. 74. (By permission.) 


-Abilities of Man, New York: Macmillan, 1927, 


84 Definitions and Analyses of Intelligence 


Transposing, we get 
hig 10: 
From this tetrad equation (so called because the test correlations are 
dealt with in sets of four) may be obtained what is known as the 
tetrad difference. ; =. 

The tetrad equation may be written for the combination of any 
four tests, By rearranging the four coefficients, three tetrad differ- 


ences may be obtained for every combination of four tests, Thus, 
using 1 as the notation for tetrad difference: 


Ty; Tea 


Theoretically, the tetrad difference criterion is satisfied if t is zero. 
When it is zero, Spearman and others have offered mathematical evi- 
dence to demonstrate that a single common factor can account for 
the relationships among the four tests, or variables. But in actual 
fact, the difference is rarely if ever zero. However, if the differences 
are close to zero, we may also conclude the criterion is satisfied, since 
correlations between tests will be affected by errors of measurement 
due to chance and accidental factors.” (Cf. discussion of reliability 
in Chapter 1.) Furthermore, the correlation coefficients would also 
be affected by the operations of the specific factor in each test. The 
specific factor was postulated to explain, in part at least, the tetrad 
differences that were greater than zero. 

If more than four tests are being examined to disclose the func- 
tions involved, we may substitute, say, tests numbers 5 and 6 in the 
tetrad equations in place of numbers 3 and 4. Thus we would be 
analyzing tests 1, 2, 5, 6. Then if the tetrad difference criterion is satis- 
fied, it would be concluded that the functions common to 1 and 2 are 


identical with those common to 5 and 6. Assuming that the tetrad dif- 


ference criterion was satisfied also for tests 1, 2, 3, 4, then the same 
functions (or factor) are said to be common to the six variables. The 


same reasoning may be applied to any number of tests. 
The principal conclusion drawn by Spearman and some others, 
after their analyses, was that the degree of correlation between any 


* Formulas are provided for calculating probable errors of tetrad differences; 


comparison of a tetrad difference with its probable error (PE) enables one to 
decide whether it differs significantly from zero. 


Factor Analysis 85 


two tests is dependent upon the extent to which g is involved in each. 
Subsequent investigations showed, however, that some test intercor- 
relations may include their own common factors beyond the single 
common g. It was necessary, therefore, to postulate the operations 
of group factors, each group being effective in two or more tests, 
but not in all of them. Spearman and others recognized such group 
factors as numerical, verbal, speed, mechanical, imagination, and at- 
tention. In addition, Spearman had postulated three nonintellective 
factors which influence one’s mental effectiveness: perseveration (p)3 
oscillation (0), being one’s variability in performance in continuous 
mental activity; and will (wv), being one’s persistence in effort. 

The tetrad difference criterion does not in itself show the relative 
weight or importance of the common factor in each kind of test. Fol- 
lowing the work of Spearman, therefore, methods have been devel- 
oped ‘for finding the weights (commonly called “loadings”) of each 
factor—general or group—in each of the intercorrelated tests. These 
methods are known as “factor pattern analysis.” “ 

The two-factor theory can account for the universal positive corre- 
lation coefficients among the various kinds of test items included in 
scales to measure mental ability, since every form of test requires 
the operation of g to some degree. Pooling a variety of kinds of tests 
in a scale is sound practice, according to this theory, because we 
thereby approximate a measure of pure g Since the s factors are un- 
correlated within any individual—that is, they may be possessed in 
varying and random degrees by him—they will be a negligible factor 
in the total performance in a pooled test of general ability, because the 


varied s factors will tend to cancel out one another. 


The two-factor theory has been criticized by some 
iets, notably G. H. Thomson and L. L. Thurstone. 


Thomson offers a sampling theory ” to explain the same tables of in- 
tercorrelations. Briefly, his view 15 that the coefficients of correlation 
are the results of common samplings and combinations of independent 
factors, The number of common independent factors utilized by two 


20 p amis ey, Essential Traits of Mental Life. Cambridge: 
Bared Savant Press, 35; L. L. Thurstoney vectors of the Mind: Mul- 
tinle-Pactor anal s for the Isolation of Primary Traits, Chicago: University 
of Chicago Press, 1935- 

“G. H. Thomson, The Fact 
Houghton Mifflin, 1939, Chapter 3. 


Sampling Theory. 
statistical psycholo 


rial Analysis of Human Ability, Boston: 


86 Definitions and Analyses of Intelligence 


tests will determine the coefficient of correlation between these two. 
This theory is, of course, the same as Thorndike’s, except that Thom- 
son concedes the practical usefulness of a concept like g. Thomson 
also adds that if several tests call upon many elementary factors in 
common, they will not only have a very marked or high coefficient 
of correlation, but they will give the appearance of having one com- 
mon comprehensive factor. Also, Thomson’s theory maintains that 
if several tests draw upon a relatively smaller number of the ele- 
mentary factors in common, we have then group factors—that is, a 
limited number of factors that enter into performance on types of 
tests which are distinguished by the fact that they have certain mental 
processes in common but do not share a very large number of ele- 
mentary factors or a universal g. 

While both theories require that a scale to measure general mental 
ability should pool a variety of types of tests differing in content and 
mental processes employed, the two-factor theory would seek sub- 
tests (parts of the scale) that have high intercorrelations, whereas 
the sampling theory would seek subtests having low intercorrelations 
among themselves but high correlations with the criteria of validity. 


Group-factor Theory. As already stated, Thurstone and others be- 
lieve that a group-factor theory fits the facts best and is most useful 
in testing practice, Their view differs from Thomson’s in that they 
reject the theory of a very large number of independent factors. As 
previously explained, a group-factor is conceived of as an operational 
concept to account for correlations of performance within only a 
limited group of tests.** Several different groups of factors are neces- 
sary to account for all mental activity, plus the more recently added 
second-order g factor, which Thurstone states may be more “central” 
in character and more “universal” in influence. 


Thurstone has contributed very significantly to the methodology of 
group-factor analysis, particularly his geometric methods; and from 


his analyses, as we shall see, has emerged a scale to test menta il- 
g abi 


ity. This volume is not the place to present his techniques; we shall 
merely state his purposes.” 


*“Most recently, some group-factor theorists have characterized “primary 
factors” as facilities of the mind and as media of expression. 
* A number of others have made significant contr utions to factor theory, 
especially K. J. Holzinger. H. Hotelling, R. C. Tryon, H 


- E. Garrett, C. L. Burt, 
J. C. Flanagan, J. P. Guilford, and P. Vernon. 


Factor Analysis 87 


Three objectives, according to Thurstone, are to be achieved by 
factor analysis: (1) determination of the smallest number of primary 
mental abilities to be postulated as an explanation of tables of inter- 
correlations; (2) determination of the amount of each primary abil- 
ity that is involved in each test; and (3) determination of regression 


TABLE 12 


The Two-Factor Pattern 
Test General Factor Specific Factors 


1 x S: 
2 x S: 
3 x Sa 
4 x S, 
5 x S 
6 x S; 
The Group-Factor Pattern 
Test Group Factors 
A B Cc D 
1 x x A 
2 x x x 
3 x x 
4 x x x 
5 x x x 
6 x x 
Factor Theories Combined 
Test General Factor Group Factors Specific Factors 
‘ A 
] x x S, 
2 x x S: 
3 x x S: 
x 4 
3 > ? z 
6 x x Sy 


equations whereby the amount of a primary mental ability in an in- 
dividual can be estimated from tests that draw upon that ability. As 
an illustration, we May consider several tests in which only two group 
factors are involved. An individual might make a high score in these 
tests either by having 4 moderately high level of ability in each of 
Factor I and Factor I, oF by having very much of one and little of 
the other, Also, if Factor I carries much heavier weight in the tests 


88 Definitions and Analyses of Intelligence 


than does II, then a high level of ability on I is more important for 
high performance on these tests than is a high level of Il. Thus, the 
Thurstone method would find the relatively few primary or basic 
mental abilities, devise a scale to measure all of them, and so organize 
and score the subtests as to reveal each individual's relative strength 
in each factor. 


Summary. Methods of factor analysis differ somewhat in their 
assumptions, and analysts differ somewhat in their interpretations or 
results, but the general conclusions derived by the several methods of 
analysis and interpretation do not differ radically, All factorial theories 
now postulate the presence of group factors, although the groups are 
not always identical and differ in relative emphasis placed upon them 
by different theories. Most theories also find a general factor neces- 
sary to explain intercorrelations, although here again emphasis upon 
the general factor varies. All agree that an individual’s mental activ- 
ities are attributable to the various ways in which the general and 
group factors combine in the performance of varied mental tasks. 
While several methods of factorial analysis are possible, basically the 
choice between them and interpretations derived through them must 
rest upon psychological theory and concepts rather than upon sta- 
tistical methods. Factors should not be regarded as fixed, predeter- 
mined mental entities. The factors that are found are influenced by 
the ages of the persons tested, by interests, by experience and train- 
ing, and by the test items originally employed in the preliminary in- 


vestigations, Factorial analysis is a statistical method which provides 
the means of improving test construction and of classifying test per- 
formance. 


ILLUSTRATIONS OF FACTORS 


The following illustrations will assist the student to grasp 
more fully the problem of factors 


Since all the coefficients in 


Illustrations of Factors , 89 


since all the coefficients are far from perfect (+1.00), we are war- 
ranted in using all four to sample the testee’s mental abilities, rather 
than only one or two to the exclusion of the others. The fact that 
these four tests are not perfectly correlated—nor nearly so—might be 
due to one of these possibilities: (1) that each test samples the g 
factor in different amounts, plus its own specific factor; or (2) that 
the tests have g in common but each test samples also one or more 
ugh not necessarily the same ones; or (3) that each 


group factors, tho 
factors in common with every other one, as 


has many highly specific 
TABLE 13A 
Intercorrelations of Four Subtests of the Wechsler 
Intelligence Scale for Children “° 
Vocab. Info. Sims. Comp. 


Vocabulary - -74 .66 60 
Information = .67 61 
Similarities = 61 


Comprehension 
TABLE 13B 
Obj. Assemb. Comp. Arith. Digit Sp. 


Object Assembly - 13 .20 .13 
Comprehension = 46 a 


Arithmetic 

Digit Span = 

A technical factorial analysis would go beyond 
an effort to determine which of these three 
hypotheses is the most plausible one, and to determine to what ex- 
tent performance on each of the four tests calls upon whatever factors 
—g or others—might be inferred from the statistical analysis. 

Table 13B, by contrast, shows four subtests that have low inter- 
Correlations. Of the six coefficients, only two (.40 and .46) are large 
enough to suggest that the subtests involved have much in common 
as regards psychological functioning. The coefficient of .46 between 
Comprehension and Arithmetic is attributable to the demands that 
both of these tests make upon reasoning ability, or, more specifically, 
ability to analyze a set of given material and then reorganize the ele- 
Ments toward the solution of the specified problem. The coefficient 


Well as unique factors. 
this inspection analysis in 


2 From the Manual. Psychological Corporation. 


90 Definitions and Analyses of Intelligence 


of .40 between Arithmetic and Digit Span is attributable, it appears 
from the characteristics of the two tests, to facility with numbers and 
ability in immediate recall (as contrasted with delayed recall). The 
remaining four coefficients are so low as to suggest that the tests con- 
cerned have little dependence upon common functions (whether g 
or other factors), and that each makes demands upon some factor or 
factors not called upon by the others. Here again, a factorial analysis 
would attempt to identify more precisely the factors involved; but in 
so doing the analyst would have to apply his knowledge of psycho- 
logical functioning to the items that are shown by the analysis to 
cluster together. 

Figure 3.1 shows in graphic form how two and three tests might be 
interrelated. As the number of types of tests increases, the possible 
factor interrelationships may become more numerous and complex, 
though it is extremely improbable that no overlapping whatever of 
factors would be found in measuring human abilities by means of two 
or more different types of tests. In view of the more recent partial 
reconciliation of the group-factor and the general-factor theories, the 
illustrated overlappings are most probably attributable to g. 

The possible factor interrelationships of the parts of Figure 3.1 are: 


A. Each test is factorially independent of the other. The factor or 


factors in each are unique to it, either as “group” factors or as “spe- 
cific” factors. 


B. The overlapping shaded area indicates a factor or factors com- 
mon to both tests. This may be g or a group factor. The unshaded 
area indicates factors unique to each, either specific or group, or 
both. When a number of diverse tests show some overlapping among 


all of them, the soundest inference is that a g factor accounts for the 
common ground. 


C. In this instance the tests may be measuring only the general 
factor, or g plus the same group factor, or just the same group fac- 
tor. There are no unique factors. It is extremely improbable, in this 
situation, that the general factor is not involved. If numerous pairings 
of different tests showed this relationship, the soundest inference 
would be that a general factor is being measured. 


D. Each of the three tests is factorially independent of the other 
two. The uniqueness of each may be due to group or specific factors, 
or to both. 


E. The overlapping of 1 and 2 here may be attributable to g or to 
a group factor. Test 3 is independent of the others. The nonoverlap- 


Illustrations of Factors 91 


ping segments of 1 and 2 may represent separate group factors or 
specific factors in each test. 


F. In this figure, overlapping between 1 and 2, and between 2 


and 3, is attributed to one or more group factors, but different ones 
in each case, since there is no common ground between all three tests. 


Ho ®Ų B 


DEO M 


3.1. Possible Intercorrelations Be- 
tween Two and Three Tests. 


FIG. 


resent special factors or group factors, 


The unshaded areas would rep 
i ith either of the other two tests. 


or both, which are not shared w 


G. Here there is some common ground in all three tests (shaded 
area), which is interpreted as showing the presence of the g factor. 
The dotted areas show group factors shared by only two of the tests. 
The unshaded areas represent either specific factors or unique group 


factors, or both. 

H. This figure represents three tests that have only the general fac- 
tor in pointer Both Tests 2 and 3 have the same amount and the 
identical area ie common with Test 1; hence, they have the same 
amount and identical area in common with each other. 


92 Definitions and Analyses of Intelligence 


These graphic illustrations of correlations and factor “loadings” 
derived from statistical analysis serve three purposes: (1) They dem- 
onstrate the complexity of the problem of determining interrelation- 


[ v | ve | R [e 


READING COMPREHENSION 


v| 0 


MB pm 


C e RZ 


DIAL AND TABLE READING 


| s | mm leie PAZ 


DISCRIMINATION REACTION TIME 


FIG. 3.2. Diagrams of the Component Variances of Three 
Army Air Force Classification Tests. (From Guilford, 
op. cit. p. 86.) The letters stand for: 
V __ verbal-comprehension factor 
ME mechanical-expericnce factor 
R, reasoning I (general-reasoning) factor 
R. reasoning II (common to analogies tests) factor 
V. visualization factor 
O other common factors, cach with variance too 
small to mention separately 
U unknown common-factor or specific-factor variances 
E error variances 
N numerical factor 
S, space I (spatial-relations) factor 
P  perceptual-speed factor 
\MB mathematical-background factor 
M, memory IT (visual-memory’)) factor 
PM. psychomotor II (precision) factor 


ships of psychological factors. (2) They demonstrate that the same 
statistical findings are often open to more than one psychological in- 
terpretation; and, using the statistical findings as aids, one’s interpreta- 
tion will depend basically upon his psychological analyses of intel- 
lectual functioning. (3) The illustrations and their several possible 


Illustrations of Factors 93 


interpretations help to make clear the reasons why the most valid and 
useful tests within a given category (e.g. intelligence) have much in 
common as regards psychological functioning and as regards test 
items themselves. 

Finally, Figures 3.2 and 3.3 illustrate elaborate factorial analyses 
of tests which have been statistically fractionated.** These indicate the 
probable quantitative portions of each of the several factors in each 
of the tests. Such analyses do, undoubtedly, provide insights into the 


s, | ME | PI ASHE WLLL: 


PILOT CRITERION 


o 
CE PHY 
NAVIGATOR CRITERION 

ric. 3.3. Diagrams of the Component Variances of Pilot 
and Navigator Training Criteria. (Irom Guilford, op. cit., 
p. 86.) Letter symbols as defined with Figure 3.2, except 
for some additional ones: 

PI __ pilot-interest factor 

M, memory IV (content-memory ) factor — 

M memory III (picture symbol association) factor 


PM, psychomotor I (coordination) factor 
LE length-estimation factor 


at combine in performance on the tests. 
y facilitated. It should not be assumed, 
however, that each of these factors exists or operates independently. 
We may look at the factors in the Reading Coniprehensian test as an 
example. We note that “verbal comprehension is the largest single 
factor; then we have, in order, “mechanical experience,” “reasoning 
I” and “reasoning I.” It is very doubtful that these four factors can or 
should be separated functionally. Mechanical experience will have 
been significant in the comprehension of the verbal materials of this 


test and will have contributed to the verbal competence of the testee 
aa a DUE 


aj, P. Guilford, “Factorial Analysis in a Test-Development Program,” Psy- 
Sholosnar Rave, WoL. 35; OS BP SG 


Psychological operations th 
Test construction is thereb 


94 Definitions and Analyses of Intelligence 


in the particular area covered by the test. Reasoning, of whatever 
kind, is basically a matter of problem-solving ability, whether with 
the use of concrete objects or with abstractions (words and numbers). 
In this test of Reading Comprehension, the examince’s ability to rea- 
son with the problems presented will be dependent in part upon his 
mechanical experience (contributing to his comprehension) and to 
his knowledge of the terms used in the problems. Conversely, the ex- 
tent and quality of the vocabulary he acquires and the degree to which 
he benefits from his mechanical experience will depend in part upon 
his present and potential reasoning capacity. 


IMPLICATIONS 


The hypotheses as to the nature of mental abilities have been 
arrived at by means of several methods of statistical analysis and 
through partially different interpretations placed upon much the same 
data by different investigators. Regardless of which of the hypotheses 
an author of a test follows, the instrument he develops will have very 
much in common with those constructed by authors who base their 
tests on one of the other hypotheses. In many respects, the processes 
of standardization will be the same; the same basic principles of con- 
structing and testing will have to be observed. A variety of mental 
activities will have to be sampled; in the case of the multi-factor the- 
ory, in order to sample an adequate and representative number of the 
many minute factors; in the case of the group-factor theory, in order 
to sample the “primary” abilities and those “second-order” factors 
that migĥt be found subsequently; in the case of the two-factor the- 
ory, in order to get an adequate sampling of the general factor, 

The main practical differences arising from the theoretical differ- 
ences will be found in the tests based on the group-factor theory, as 
compared with others, The differences will be these: the parts of the 
test based on group-factor theory must correspond with the factors 
or “primaries” and they must try to measure these factors in as “pure” 
a form as possible; the subtests in a scale based upon group-factory 
theory should have low intercorrelations; the test based on group- 
factor theory will emphasize the separate scores on each of the “pri- 
maries” and will provide a “mental profile,” even though the group- 
factor test might also provide an over-all index. The Binet type of 
test, and other g-factor tests, on the other hand, consists of a variety 
of test materials which are a composite of abilities, yielding a mental 


Implications 95 


age and an intelligence quotient. Most group tests,** while arranging 
their items according to type (sentence completion, arithmetical rea- 
soning, word meaning, picture completion, form perception, etc.), 
are not organized on the basis of specifically defined factors; and, 
like the Binet type. they generally yield a single index of relative rank. 

Since the several hypotheses regarding the nature of intelligence 
have thus far produced relatively few differences in practical test con- 
struction and application, the reader might well ask: “Why, then, be 
so concerned with definitions and theories, when the end-results are 
not radically different?” The answer to this question has several 
aspects. First, the student should be familiar with the thinking of 
psychologists in this field, as a background for his better understand- 
ing of the tests themselves. Second, it is through the interaction of the 
theoretical and applied that improvements and advances will be made. 
e that one or more of these theories will have in- 


Third, it is possibl 
in the future, upon test construction, testing prac- 


creasing influence, 
tice, and test interpretation.” 

With these definitions and 
now to an examination and ev 
current tests of intelligence. 


theories of intelligence in mind, we turn 
aluation of some of the representative 


hich items of various mental operations 
tead of being grouped in subtests, 


xcepting the “omnibus” type. in whic 
are placed in regular Or irregular order, ins! 
each containing items of a single kind. 
"A useful presentation of some problems of test construction will be found 

4 Approach to the Construction and Evaluation of 


J. Loevinger. A $ ystematic / 
SES of Ability chological Monographs, Vol. 61, No. 4, 1947. See also 
an ” The Ameri- 


L. L. Thurstone. “Ps chological Implications of Factor Anal 


can Psychologist. Vol. 3, 1948. pP- 402-408. 


: _I8-€L¢ ‘dd 
0681 ‘SI 'IOA ‘PUI <IUE pue SISIL uN» TYE WON ‘r, 


pue ‘Árowour ‘own uonoear ‘Furddez JO Aupider “UONPSUDS JO sjsaq 
Ayoedeo Sun] pue gyom yysrey Jo seansvolw pasn Hogi ‘ssou 
ƏM Jenu! .stidnd sayy Jo sayvurnyse siayora} Pouleiqo oY ow 
URS Ap WW “AJOWAW pue “Survay suora Hot} parsa} Ose pue udp 
Tryo JO sjuowainsvou peorsáyd apeur seog OWH uonovel JO $189) pur 
KIOWA JO $}S9} suoje UOISIA JO $}S9} ‘UOA PUL yono} y0q Suton 
UI $]S9} 3AVLF mosef 'Jneg pue uep jo əsoy} 0} SEISIS Sjena 
ew pue spoypw SurAojduua ‘s}so} jeosojoyossd ws sunuəsunəd 
XƏ osje o19M Prosge pue saws parun oy} U! s1OWSNSOAU! IONIC, 
-poppe rowa 39701 IW! vB y} *S}S9} 1070W 


ue ÁIosuəs əjdus ‘ed jsow əy} 10} ‘91V esol} WYL SNOIAQO SI yy 


<Crowow alos oPIpowwr (01) 


SJULUOSUOD JO Salas e Suisn 
spuooas U9} JO [PAID} 
-ur Uv ddNpodas Uv penprarpur UB YoIYyAr YIN Aovanosv IL (6) 
aoe aul] 10PWNUD 


-AyY X JOasiq Ud jenpraipur ue yorym yum Aovingoe IL (8) 
Jopso paezeydey ul pasuve s10]O9 JUALIYIP INO} 


jo suəwəds ud} əwvu uvd [eNprlAlpur uv yorym YIM paads (L) 

punos & 0} joval uvo penpraipur UL YOIYA UIA poads (9) 
uorssəvəns Ul pagi 2q 94519m OA) IYI Surinbas Aq poinsvou 

‘pousaosip oq ueo yoy IYSIOM Ul dUdIIY!IP ysaypetus 34L (S) 
sgoqqna pey Jo dins e yim prayosoj əy} uodn oinssoid 


Sudoxo Aq uwd əsneə 0} Arvssaoou ainssoid jo yunowWY (p) 
„uoneununosip Julod-oat},, SB UMOUY “urys 


ay} uo syurod om} UdaMI9q JURISIP afqndasiad ysaypews OYL (E) 
saajaumuas Ayjy JO durIsIp V ysnosy) posou aq 


ueo puey ayy yoryar ur awu Isaysinb oy) Huawoaow JO mY (T) 
Jojawoweusp oy) Suisn ‘diss Jo wuss (1) 

-Surm ogo} OY} 919M IUN FYI IV sysaq 
aplesar A[YSIYy SOU! sty JO UIT, ,"SaIw3g POUN et} U! Sunuouiiadxa 
BAN [ONLO UYIN ‘f yora yA asoy) Aq paquasaidad [JO 31e 0061 
} xod pəsn Suraq ələm yoryar sysoq peorgojoyoÁsd Jo spury 4L 

“ay sea 
sisojoystsd youary ay) Surg Papy JO souUaNyU! Əy} UIA ‘0061 
oqe [HUN uonejuawnodxə Isa} JO əmu AP PYL ‘ssepoyytorou 
IP YOM SUPO Puau, e9 aa yora sassadoid xəjduoo 
rowu PUL 1əyfiy əy} JO uonenjeaa I0} ənjea OU JO ƏM AIDA savy 
nseatu əsəy) ey, PaywssuUOWap uəəq HUIS sey HW YysnoyYy “Ayord 


6 puno.syovg [DILIOISI ET 


WEO,, a) se umouy si YA pue ‘uoneuruosıp 
} JO Juawainseau ay 10} Isa} eB pəsn ay ‘jdur 
-XƏ J04 ‘Pastaap ay yorum YOHRUILULIDSIP L10suas pue Araseun jo s}s3} 
aB1][91UI oInseour 0} idwaye pjnoys ueo y} 
—əutdosip aiwiedas v se ASooyoAsd Ruəunədxə jo souaS.10u9 
OU} Pue ‘ponad snp Sunnp vus lĽoSojorq Jo dourpuaose ay) sid 
-O]OIg & se Punoisyoeq SULD Jo mara Ul—pajsadxa aq 0} sem YW i 
"OQLNLAL UI sueau Aue uey} plea pue sanoafqo 
SOU 3q P[nom yorym SeN [eoIsojoyodsd Sunsay 40 spoyiau dojaaap 
Əy PY “orOJo10y1 ‘Kressaz9u sem W «suens 1əVəq Áq y903s uewny 
yuyur Sunurjddns,, yo Sontiqissod əy} sny} pue ‘Sontiqe pezuəu Jo 
quatudojaaap əy) uodn JUIWUOIIAUD pue Aylpaiay Jo S199499 oy} Apnqs 
0} Paysim oym Ystuasno pue jsionauad ysisuq ay} UOVO suerg 
Áq S0981 omy ur unĝəq sem Sunsa, ur YOM jezjuswunodxqg ‘sənıqe 
UF saduarayip enplaipur ainseow 0} s}duiane pue ur sjsorajur ,s}s18 
~O[oydAsd Áq PAOU sea sjsaq jezuour JO Jusudojanap ayy, 


GNNOYONOVE IVOMMOLSI 


SATVOS LINIĄ AHL 


p RESIS OUSRESAAAAA E ENEE ENO GAM UA ES 


Y 


98 The Binet Scales 


suggestibility. He also obtained teachers’ ratings of the pupils’ men- 
tality. ; RS. 

Several other investigators during this period were beginning to 
introduce somewhat more complex materials and methods. Kraepelin 
and Oehrn, in Germany, used tests of perception involving the count- 
ing of letters, cancellation of letters, and detection of errors on the 
printed page; tests of memory involving digits and nonsense syllables; 
tests of association and of motor functions. 

Muensterberg also devised tests of a more complex kind: reading 
aloud rapidly; giving rapidly the colors of named objects; rapidly nam- 
ing and classifying animals, plants, and minerals; naming rapidly and 
classifying cloth, food, and parts of the body; naming rapidly ten 
simple designs and ten squares of colors; tests of addition; rapidly 
counting angles in irregular polygons; naming different odors, He also 
employed tests of memory for digits and letters after a single presenta- 
tion; bisecting, judging, and reproducing lengths of lines; locating a 
sound; constructing a square and an equilateral triangle, with only the 
base of each given. 

Some of Muensterberg’s tests, it will be noted, were more complex 
and more varied than those of other experimenters; yet, like those of 
others, his were tests essentially of simple psychological processes, 
with a premium placed upon speed. 

These relatively simple tests of sensory, motor, and memory capac- 
ities proved to be of very little value as measures to reveal intelligence. 


In the first place, their intercorrelations were very low, r 


anging gen- 
erally 


from zero to only .20. And, in the second place, the results of 
these tests, when correlated with academic performance, yielded cor- 
relation coefficients of much the same magnitude, many of them being 
less than .10—hence useless for purposes of prediction. As a matter 
of fact, experimentation in the years that followed the early investiga- 
tions has consistently confirmed the negligible or very low correlations 
found to exist between sensory and motor capacities, on the one hand 
and the higher more complex functions, called “intelligence,” 
other. 

It is now generally recognized by psychologists that intelligence has 
little relationship to the elementary sensory and motor processes, and 
but a very moderate relationship indeed to capacity for rote memory 
(a correlation of about .30). Many infra-human animals have keen 
sensory discrimination. Mentally deficient children in the higher levels 


> 


on the 


The Early Work of Alfred Binet 99 


of defect and children in the “borderline” group are not very inferior 
to normal children in respect to skin sensitivity, visual acuity, auditory 
acuity, reaction time, etc. Nor are intellectually gifted children su- 
perior to average in these respects. But in the capacities to learn, to 
organize and direct thinking, to adapt behavior, to comprehend prob- 
lems and deal with abstractions, in levels of information possessed, in 
extent of curiosity about one’s environment, these groups do differ 
very markedly indeed. 

The reader should bear in mind these early kinds of tests, not only 
for historical purposes, but also in order to compare the early efforts 
with currently available tests and to be more clearly aware of the 
direction in which psychological testing has been moving. 


THE EARLY WORK OF ALFRED BINET 


The earlier work of Binet, the French psychologist, was along 
much the same lines as that of the American and German psycholo- 
gists already mentioned. He used tests of tactual discrimination, re- 
action time, visual discrimination, auditory discrimination of time in- 
tervals, reproducing letters and numbers from memory, and so on. 
But though he experimented with these materials until about 1900, 
he had begun to doubt, some years earlier, the value of continuing 


with them. 

Although some 
measure and with w 
did nevertheless point out 
and in fact did develop; t 
mental functions must be mea 
and motor processes. 

Binet and his collaborat 
tests which followed Galton’s 


of the mental activities which Binet proposed to 
hich he was experimenting were as yet vague, he 
the direction in which mental tests should 
hat is, that the higher and more complex 
sured rather than the simpler sensory 


ors objected to the kinds of psychological 
work, on the ground that they were too 
simple in character and would contribute little to the understanding 
of differences among persons in respect to the higher mental func- 
tions. Binet maintained, furthermore, that intelligence is expressed 
not in the form of simple, segmental responses, but rather as a com- 
bined mental operation wherein whatever processes are involved Op- 


erate as a unified whole. k f 
It is in these higher functions that individual differences are most 


marked: it is these which distinguish individuals most significantly 
and abaractaristically in daily activity, whereas it is in the simpler 


100 The Binet Scales 


sensory and motor processes that persons differ least significantly. 
Binet was quite ready to grant that the simpler Processes, such as 
those already mentioned, lent themselves to more precise measure- 
ment and, therefore, yielded more constant results. Yet his interests 
were strongest in individuals rather than in sensory and motor proc- 
esses as such. Thus, he was ready to sacrifice the greater quantitative 
precision of sensory-motor tests in order to obtain more valid evi- 


s 


ric. 4.1. Alfred Binet (1857-1911). 


dence of the integrated mentality of the individual; for he 
that in the measurement of the higher functions, the greatest pre- 
cision of measurement, though desirable, was not as essential as in 
measuring the simpler processes, because of the very fact that indi- 
viduals differ much more markedly in the former, 

Binet emphasized that his Proposed scale, testing the higher and 
more complex mental functions, would not measure in a physical 
sense, in the same way, for example, that the length of a line is meas- 
ured. His tests would, however, yield “a classification, a hierarchy 
among diverse intelligences; and for the necessities of practice this 
classification is equivalent to a measure.” 2 
Binet and his collaborators were interested in establishing the ex- 


T 2A. Binet, The Development of Intelligence in Children, 
Kite, Vineland Training School, 1916, p. 40. 


argued 


trans. by E. S. 


The 1905 Binet-Simon Scale 101 


tent and nature of variations of mental functions from one individual 
to another, and in the determination of the interrelations of the sev- 
eral functions within the individual. In 1896, therefore, Binet and 
Henri (a collaborator) published their studies of the following func- 
tions: memory, the nature of mental images, imagination, attention, 
comprehension, suggestibility, esthetic appreciation, moral sentiments, 
muscular strength, strength of will, motor skill, and visual judgment. 
These, they believed, were functions “. . - which differ much from 
one individual to another, and are such that knowledge of their state 
for an individual gives us a general idea of this person, permits us to 


distinguish him from other individuals belonging to the same milieu.” ë 
Here, then, we have the beginnings O 


f the tests which a few years later 
n the construction of the Binet scales and subse- 
quently in the several revisions developed for use in the United States. 

In 1904 a practical situation arose in which Binet had an oppor- 
tunity to apply his principles with regard to differentiating individuals 
and to make an outstanding contribution to the study of mental abil- 
ities and individual differences. The French Minister of Public In- 
struction appointed a commission to recommend measures to be 
taken in the education of mentally subnormal children in the schools 
of Paris, for it was recognized that these children were unable to 
profit from the usual instruction. The plan was, therefore, to elimi- 
nate subnormal children from ordinary schools and to provide 
adapted instruction in a special school. Admission to the special 
school was to be based upon a medical and a pedagogical examina- 
tion. Obviously, the first device needed was some objective means 
of selecting pupils of subnormal mentality. Subjective opinions were 
worse than useless; for not only was there absence of agreement 
among different so-called “experts,” but serious injustices could re- 


sult in numerous cases. 


proved so useful i 


THE 1905 BINET-SIMON SCALE 

It was to meet this problem that Binet constructed his first 
Scale to evaluate children’s intelligence levels. This instrument 1s 
known as the 1905 Binet-Simon Scale.’ In it we find a fundamental 


Binet and y. Henri, “La Psychologie Individuelle,” 
E Année Psychologique. VOL 2, 1896, pp. 41 1-465. For most of Binet’s contri- 


butions $ s2 peychologiqte. Vols. 1-17. 
Sys se Psycholog!ą 4 v s A 
: a hee AoE the test items are arranged in order of increasing 


difficulty. 


3 For details, see A- 


102 The Binet Scales 
U. 


conception underlying all tests by means of which mental abilities 
of children are measured. The principle is that we may identify dif- 
ferences in mentality, differences in degrees of brightness or dullness, 
with differences in levels of development as represented by the aver- 
age capacities of children of various ages. Thus, if we know the levels 
of intellectual performance of typical, or normal, children at each 
age, we can determine in the case of any individual child the extent 
to which his mental development is accelerated or retarded, or 
whether it is just about at the average level for his age at 
time. 


In the 1905 scale, this conception was only crudely implemented: 
but in time it became more precise and has since taken the form of 
indexes already discussed in Chapter 2: namely, mental age, percen- 
tile ranks, decile ranks, standard Scores, intelligence quotients, and 
others. 


The thirty items, in order of increasing difficulty, 
1905 scale follow: ° 


any given 


comprising the 


(1) Visual coordination—degree of coordination of movement 
of head and eyes as a lighted match is slowly moved before subject’s 
eyes 

(2) Prehension provoked by tactual stimulus—a small cube of 
wood is placed on back or palm of the subject's hand to see if he 
grasps it and carries it to his mouth, and coordination of movements 
is to be noted 


(3) Prehension provoked visually—cube of wood is placed 
within Subject’s reach by examiner who notes whether subject 
grasps it 

(4) Recognition of food—a small piece of chocolate and a piece 
of wood of same dimensions are shown successively, and Signs of 
recognition of food and efforts to take it are noted 

(5) Seeking food when slight mechanical difficulty is interposed 


—a small piece of chocolate, wrapped in a piece of Paper, is given 
to the subject, and his manner of separating the food from the paper 
is noted 


(6) Execution of simple directions and imitations of simple ges- 
tures 


(7) Verbal knowledge of objects—parts of the body (head, e 
nose, etc.) are indicated by the subject, and c 
string, cup) are handed to examiner on request 

(8) Verbal knowledge of objects in a picture, 
ing out objects, the names of which are given 


ar, 
ommon objects (key, 


as shown by point- 


° For a detailed description, see G. M. Whipple, Manual of Mental and 
Physical Tests, Baltimore: Warwick and York, 1910, pp. 475 ff. 


The 1905 Binet-Simon Scale 103 


(9) Naming of objects designated in a picture 

(10) Comparison of lengths of two straight lines, pointing out the 
longer 

(11) Repeating three digits immediately after hearing the series 
once 

(12) Comparison of weights; identical-appearing blocks of wood 
weighing 3 and 12 grams, 6 and 15 grams, 3 and 15 grams 

(13) Suggestibility—asking the subject for an object that is not 
present (modification of 7); asking subject to point to a nonexistent 
object in a picture, designated by a nonsensical word (modification 
of 8); comparison of lines of equal length (modification of 10) 

(14) Definitions of familiar objects: such as, house, horse, fork 

(15) Repetition of sentences having fifteen words each, after 


hearing each one only once i 
(16) Giving the differences between two common objects: e€.g., 


wood and glass, a fly and a butterfly 

(17) Immediate recall of pictures of familiar objects—pictures of 
thirteen common objects are shown for thirty seconds, after which 
the subject names as many as he can recall 

(18) Drawing from memory two different geometric designs 
which have been shown simultaneously for ten seconds 

(19) Repetition of series of digits, beginning with a series of three 
and proceeding until the subject's limit Is reached ; , 

(20) Giving resemblance between common objects: ¢.g-, a wild 
poppy and blood; an ant, a fly, a butterfly, and a flea ; 

(21) Rapid comparison of lengths of lines: a line of 30 cm. is 
compared with fifteen others varying from 31 to 35 cm.; then a line 
of 100 cm. is compared with twelve others varying from 101 to 


103 cm. 
(22) Discrimina 


12, 15 grams—all bein i ; ; 
(23) Recall of weights—one of the weights in test 22 is removed, 


the remaining weights are scrambled, and the subject is asked to 
identify the missing weight or gap in the series 

(24) Giving rhymes to selected words 

(25) Sentence completion—supplying the correct word to com- 


ting and arranging in order five weights—3, 6, 9, 
g of equal size 


plete a sentence k : ‘ 
(26) Devising a sentence to include three given words: e.g., Paris, 
gutter, fortune far z 

G ding and giving replies to twenty-five problem- 


2 hen 
e raed in difficulty: ¢.g., What is the thing to do when you 
are sleepy? Why is it better to continue with perseverance what one 
has started than to abandon it and start something else? 
(28) Reversing the hands of a clock, to be done from memory; 
e.g., giving the time it would be if the large and the small hands were 
3 minutes to three. The subjects who succeed 


interchanged at four sut s 
were given the more difficult problem of explaining why the precise 
transposition indicated is impossible. 


104 The Binet Scales 


(29) Drawing lines to show the folds and cut-out of a piece of 
paper that has been quarto-folded and from which a triangular piece 
has been cut ; 

(30) Giving definitions and distinctions between paired abstract 
terms: e.g., sad and bored 


Although this set of tests was not separated into age groups, Binet 
did indicate several differentiating levels. Question number 6 was 
the upper limit of idiots (adult); question 9 was the upper limit of 
ordinary three-year-old children; number 14 was the limit of ordi- 
nary five-year-old children; number 16, that of imbeciles (adult); 
test 23, the most probable limit of morons (adult), although test 27 
was regarded as having great value in revealing the moron, In addi- 
tion, the authors reported a number of qualitative and quantitative 
differences in replies to many of the questions, thus distinguishing 
between 7- and 9-year levels, on the one hand, and 9- and 11-year 
levels, on the other. 

The order of tests in the 1905 scale was experimentally deter- 
mined, for it was established after being used with children in the 
primary schools and in an institution for the mentally deficient (the 
Salpetriére). The children in primary school were regarded as “nor- 
mal” on the basis of the fact that they were in grades just right for 
their ages—neither advanced nor retarded. Binet and Simon report 
that many such children were tested, but norms for the sc 
based upon records of only ten cases in each of the f 
groups: 3,5, 7,9, and 11 years. 

While admittedly rather crude and tentative, this first scale en- 
abled Binet and Simon to classify idiots, imbeciles, and morons ina 
more objective manner than had been possible before, 

Furthermore, in the foregoing list of thirty items, the reader will 
find many types which have since been developed, standardized 
included in a large number of current psychological tests, from 
designed for babies to those intended for adult levels. 

It is significant to note, also, that while Binet wanted to devise a 
scale that would yield age ratings, he was equally concerned with the 
quality of judgment and reasoning shown by the subject in the course 
of the examination. Binet was thus using the test situation as an op- 
portunity for a clinical interview—a practice which is becoming in- 
creasingly widespread and of increasing importance in reports of 
psychological examinations by present-day clinical psychologists. 


ale were 
ollowing age- 


, and 
those 


The 1908 Binet-Simon Scale 105 


THE 1908 BINET-SIMON SCALE 


Binet and Simon recognized the defects of the first scale. They 
recognized that an improved scale would have to provide more valid 
norms, based upon a larger and more representative sampling of 
children at each age; that tests for each age within the limits of the 
scale would have to be included to achieve finer units of measure- 
ment and greater accuracy. Their own subsequent investigations and 
those of other psychologists resulted in a new form of the test, known 
as the 1908 scale, in which the items are grouped at the appropriate 
age-levels, from 3 years to 13 years.” 

Age 3 
(1) Points to nose, eyes: mouth 
(2) Repeats sentences of six syllables 


(3) Repeats two digits ; 
(4) Enumerates objects in a picture 
(5) Gives family name 
Age 4 
(1) Knows own sex 
(2) Names certain 
penny) 


(3) Repeats three digits f l 
(4) Perceives which is the longer of two lines 5 and 6 cm. in 


length 


familiar objects shown to him (key, knife, 


Age 5 
(1) Indicates the heavier of two cubes (3 and 12 grams; 6 and 15 
grams) 
(2) Copies a square 
(3) Constructs a rectangle fr 
board, having a model to look at 


(4) Counts four coins 
(3) Repeats a sentence of ten syllables 


om two triangular pieces of card- 


Age 6 


(1) Knows right and left; indicated by showing right hand and 


left ear 


(2) Repeats n syllables 


sentence of sixtee 


6 mets mon, “Le dévelopment de lintelligence chez les enfants,” 
Wane: Pop n T ol. 14, 1908, PP- 1-94. See also the translation of 
Benes le pete ty : s. Kite. The Development of Intelligence in Children, 

s publications s e 
Baltimore: Warwick and York, 1916. 


106 The Binet Scales 


(3) Chooses the prettier in each of three pairs of faces (esthetic 
comparison) 


(4) Defines familiar objects in terms of use 
(5) Executes three commissions 

(6) Knows own age 

(7) Knows morning and afternoon 


Age 7 

(1) Perceives what is missing in unfinished pictures 

(2) Knows number of fingers on each hand and on both hands 
without counting 

(3) Copies a written model (“The little Paul”) 

(4) Copies a diamond 

(5) Describes presented pictures 

(6) Repeats five digits 

(7) Counts thirteen coins 

(8) Identifies by name four common coins 


Age 8 
(1) Reads a Passage and remembers two items 
(2) Adds up the value of five coins 
(3) Names four colors: red, yellow, blue, green 
(4) Counts backwards from 20 to 0 
(5) Writes short sentence from dictation 
(6) Gives differences between two objects 


Age 9 
(1) Knows the date: day of week, d 
(2) Recites days of week 
(3) Makes ch 
action 


ay of month, month of year 


ange: four cents out of twenty in play-store trans- 


(4) Gives definitions which are superior to use; familiar objects 
are employed 

(5) Reads a Passage and remembers six items 

(6) Arranges five equal-appearing cubes in order of weight 


Age 10 
(1) Names the months of the year in correct order 
(2) Recognizes and names nine coins 
(3) Constructs a sentence in which 
(Paris, fortune, gutter) 
(4) Comprehends and answers easy questions 
(5) Comprehends and answers difficult questions 
(Binet considered item 5 to be a transitional question bə- 
tween ages 10 and 11. Only about one half of the ten-year- 
olds got the majority of these correct.) 


three given words are used 


The 1908 Binet-Simon Scale zo> 
(oF 


Age 11 
(1) Points out absurdities in statements 
(2) Constructs a sentence, including three given words (same as 
number 3 in age 10) 
(3) Gives any sixty words in three minutes 
(4) Defines abstract words (charity, justice, kindness) 
(5) Arranges scrambled words into a meaningful sentence 


Age 12 
(1) Repeats seven digits 
(2) Gives three rhymes to a word (in one minute) 
(3) Repeats a sentence of twenty-six syllables 


(4) Answers problem questions 
(5) Interprets pictures (as contrasted with simple description) 


Age 13 
(1) Draws the design made by cutting a triangular piece from the 
arto-folded piece of paper 


once-folded edge of a qu 
(2) Rearranges in imagination the relationship of two reversed tri- 


angles and draws results 
(3) Gives differences between p: 
pretension. 


air of abstract terms: pride and 


obvious differences between the 1905 scale and 


that of 1908. In the former, there are thirty test items; in the latter, 
fifty-nine, The latter does not include the first six items of the 1905 
scale which are at the infant level; some other items of the 1905 scale 
have been eliminated, and many new ones have been added. As com- 
pared with the 1905 scale, the age-range extends higher in the 1908 
scale, There are specific groups of items for each age (thus permitting 
a more accurate rating of individuals), and a greater variety of mental 
Processes is tested. 

In the 1908 scale, there are also two new and very significant con- 
tributions to the theory and practice of mental testing and test con- 
struction: (1) the tests, after experimentation, were standardized by 
being grouped into appropriate age-levels (Binet's method is ex- 
plained below); (2) the concept of mental age is employed for the 
first time.’ 

The principal criterion empl 
ardization and age-placement o 
employed here for the first time, the concept itself 


There are several 


oyed by Binet and Simon in the stand- 
f tests was this: in general, a test was 


7 Although mental age is e! 
had been arrived at by Binet 1n 1905. 


108 The Binet Scales 


placed at the year-level where it was passed satisfactorily by from 
two thirds to three fourths of a representative group of children of 
that age. The ideal standard was to place a test at a year-level where 
it was passed by seventy-five percent of that age-group. Binet’s reason 
for setting this ideal criterion is a sound one and is made clear by 
reference to a symmetrical bell-shaped curve, which is approximated 
by most distributions of intelligence-test scores. The middle 50 per- 
cent of the group are most nearly alike, most nearly homogeneous in 
respect to the abilities being measured, as is obvious from the con- 
centration of these 50 percent within a relatively narrow range, or 
variation, of scores. Otherwise stated, those individuals constituting 
the middle 50 percent of the distribution are the typical persons of 
the age group; hence, their test performance should be regarded as 
typical or normal for their age. If the middle 50 percent of a given 
age are able to pass a test, then that same test can be passed by the 
25 percent who are above the middle group in ability, making a total 
of 75 who are able to pass the test. 

In actual experience, however, it has been practically impossible 
to devise tests that will exactly satisfy this criterion of 75-percent 
passing. Fortunately, there are other criteria of validity which are 
also of primary significance, so that tests are retained if they approxi- 
mate the 75-percent criterion, and if they demonstrate their value 
by also satisfying other demands, such as distinguishing between 
groups of individuals of known ability (mentally deficient, average, 
and superior), showing appreciable or significant differences between 
percentages passing at successive age levels, and correlating fairly 
well with scholastic achievement. These aspects of validation will be 
more fully discussed in the following chapter, in connection with re- 
visions of the Binet scale. 

Binet and Simon standardized their 1908 scale after individual 
examinations of 203 Paris school children between the ages of 3 and 
13 years. While this number is small and would be regarded as in- 
adequate in present-day test-standardization procedures, the fact is 
that these French pioneers did set a pattern of standardization which 
is being followed today, with the use of considerable statistical re- 
finement. For in addition to having suggested the criteria already 
mentioned, they also, in effect, used the symmetrical bell-shaped 
curve as a criterion, though without offering precise numerical val- 
ues. They stated, simply, that the number of children testing above 


The 1911 Revision of the Binet Scale 109 


age (superior) should equal the number testing below age (inferior), 
and the number testing at age, or normal, should be greater than the 
number who rank as either superior or inferior. 

The mental age, with the 1908 scale, was found this way: the sub- 
ject was credited with the age level at which he passed all the tests. 
To this basic level (now called the “basal year”) an additional year’s 
credit was added for every five tests passed at higher levels. The total 
was the subject’s mental age. No credits were given for a fraction of 
911 scale (see below) the calculation of mental 


a year; but in the 1 
e fractional parts. The reader will 


age was modified so as to includ 
note that this method of deriving mental age is essentially the same 
as that used with the American age-scale revisions. 

In spite of its imperfect standardization, in the 1908 Binet-Simon 
Scale and in the publications concerned with it will be found many 
of the important concepts and practices which have been employed 
for about forty years in the construction and use of psychological 


tests. 


THE 1911 REVISION OF THE BINET SCALE 
The 1908 scale created considerable interest among psycholo- 
gists in Belgium, Germany, England, Italy, Switzerland, and the 
United States. Their interest resulted in a number of valuable ap- 
plications and evaluations of the Binet scale, accompanied by sug- 
gestions for revisions. : s 
For the most part, criticisms and suggestions dealt with the age 
levels at which various items had been placed. It is not surprising 
that, in the first age-scale devised to measure intelligence, further 
and extensive applications and analysis of results should have re- 
vealed that a number of the items were misplaced. The principal 
criticism was that the tests at the lower age-levels were too easy, 
whereas those at the higher levels were too difficult, with the result 
that the former group Were rated too high, while the latter were rated 
too low. In other words, standardization of the test had to be im- 
Proved. Binet utilized the suggestions and criticisms of other psy- 
chologists, as well as the results of his own continued researches on 
the 1908 scale, the result being the 191 1 revision. 
Specifically, the major changes incorporated in the 1911 scale were 
the following: four of the tests at the 11-year level were raised to the 
12-year level: all 12-year tests Were raised to the 15-year level; the 


110 The Binet Scales 


three tests of year 13, plus two new ones, constituted the new adult 
level. Here and there, also, a few tests were placed in either a higher 
or a lower age-level. No tests were provided for the 11-, 13-, and 14- 
year levels.* In addition to these changes, a fair number of tests found 
in the 1908 scale were omitted from the 1911 scale because they 
seemed to depend too much on school learning or on very incidental 
information. 


At age-Jevels 3, 4, and 5 the tests are the same as in the 1908 ver- 
sion. 


Age 6 

(1) Distinguishes between morning and afternoon 

(2) Defines names of familiar objects in terms of use 

(3) Copies a diamond 

(4) Counts thirteen sous 

(5) Distinguishes between pictures of ugly and pretty faces 
Age 7 

(1) Shows right hand and left ear 

(2) Gives description of pictures 

(3) Executes three commissions given simultaneously 

(4) Gives value of 3 single- and 3 double-sous 

(5) Names four colors: red, green, yellow, blue 


Age 8 
(1) Gives differences between two objects (from memory) 
(2) Counts backwards from 20 to 0 
(3) States omissions from un 
(4) Knows the date 
(5) Repeats five digits 


finished pictures 


Age 9 
(1) Makes change from 20 sous 
(2) Defines names of familiar 
(3) Recognizes all nine [Frenci 
(4) Gives months of the ye 
(5) Comprehends and 


objects in terms superior to use 
h] coins 


ar in correct order 
answers easy problem-questions 


*Tnasmuch as the rate of mental development appears to decrease appre- 
ciably after age 10, it becomes difficult to devise tests which will adequately 
distinguish between yearly levels. This difficulty was encountered also by the 
authors of the first Stanford Revision of the Binet-Simon Scale (1916); but in 
the second Stanford revision (1937) the authors were able to Provide tests at 
yearly levels between ages 10 and 14. See the next chapter. 


Summary aH 


Age 10 
(1) Arranges five blocks in order of weight 
(2) Reproduces two geometric designs from memory 
(3) Criticizes absurd statements 
(4) Comprehends and answers difficult problem-questions 
(5) Uses three given words in two sentences 


The method of scoring the 1911 scale was modified so that frac- 
tions of a year could be used in determining the mental age. Since 
there were five tests at each age level (except at age 4), each counted 
as two tenths of a year. Thus, if a child passed all tests at age 6, two 
at age 7, and one at age 8, his mental age would be 6.6 years. 

According to Binet, a child whose mental age is equal to his 
chronological age is considered “regular” in intelligence; one whose 
mental age is higher is called “advanced”; and one whose mental 
age is lower is called “retarded.” The degree of advancement or 
retardation in any instance is dependent upon the extent of the dif- 
ference, Within about a year, however, William Stern was to suggest 
lligence quotient, which has since been widely em- 
egree of acceleration or retardation in intelligence. 

Binet’s scales of 1908 and 1911 provided the stimulation and the 
basis for several adaptations and revisions in the United States. The 
authors of American revisions utilized Binet’s principles and drew 
freely on his tests, as well as adding new ones and standardizing their 
instruments specifically for American children, These revisions will 


be presented in the following chapter. 


the use of the inte 
ployed to indicate d 


SUMMARY 


At this point we may very briefly summarize Binet’s major 


Contributions to the theory and practice of intelligence testing. 

(1) Ifa psychologist is to develop a test of intelligence, he should 
first formulate a working conception and definition of intelligence, 
and then proceed experimentally. As a result of experimentation, 
New hypotheses will be developed; these in turn will influence later 

the conception and measurement of in- 


test construction; thus, both i 
telligence will undergo improvement and refinement. Binet’s own 


Conception of intelligence included mainly the following character- 
istics: ability to reason and judge well, to comprehend well, to take 
and maintain a definite direction of thought, to adapt thinking to 


the attainment of a desirable end, and to be autocritical. 


112 The Binet Scales 


(2) Intelligence must be measured by testing the higher, complex 
mental processes rather than relatively simple sensory and motor ac- 
tivities. 

(3) Intelligence, being a complex, can be tested only by the use 
of a diversity of materials devised to evaluate the operations of men- 
tal processes as an integrated unit, rather than to measure the sepa- 
rate elements that might contribute to the complex functioning of 
intelligence. Though the Binet tests seem to be simple in concep- 
tion and construction, they actually involve many complex mental 
activities; memory of several kinds, apperception, free association, 
orientation in time, language comprehension, ability with numbers, 
knowledge about common objects, constructive imagination, com- 
parison of concepts, perception of contradictions, understanding of 
abtract terms, ability to meet novel situations, combining fragments 
into a meaningful whole. 


(4) The tests included must be appropriate to the environment 
of those for whom they are intended. 

(5) The tests were arranged in the form of a scale, from easiest 
to most difficult, and groups of tests were placed at appropriate age 
levels. The criterion was, ideally, that a test should be placed at a 
level where it was passed by three-fourths of that age group. 

(6) The concept of mental age was introduced. 


(7) The tests must be so standardize 


d that the large middle group 
of aver 


age children (in the curve of distribution) will test “at age.” 
(8) Other criteria of validity were introduced: known groups, 


scholastic ratings, increase in percentage passing a test at successive 
age levels, 


(9) The need of establishing the reliability of a test was recog- 
nized; Binet, therefore, made a few reliability studies with his 1911 
scale, 


Binet made not only these contributions; he indic 
extensive uses to which psychological tests could be 
tional, social, vocational, and theoretical problems, fo 
tests as tools for research and for scientific solution 
practical problems, Indeed, many of the researches and 
tests have since been applied are definitely alon 
Binet, including, among others, the testing of pri 
order to eliminate the mentally unfit. 


Binet did not regard his tests as final or as quite satisfactory; he did 


ated the very 
put in educa- 
r he regarded 
of important 
uses to which 
g lines indicated by 
Ospective soldiers in 


Summary 113 


not claim that they measured all aspects of personality; he empha- 
sized that they must be supplemented by psychological and educa- 
tional information derived by other means and from other sources. 
He did claim—and in this he has been supported by extensive sub- 
sequent use—that his test, and improved versions that should follow, 
can provide a very useful and reasonably reliable index of an individ- 
ual’s general intelligence, when the tests are administered and inter- 


preted by qualified examiners.” 


” Alfred Binet died in 1911. His premature death deprived psychology of one 
of its great pioneers. 

For a comprehen 
ment of Alfred Binet's 
1935, 


> study of Binet’s psychology, see E. J. Varon, Develop- 
Psychology, Psychological Monographs, Vol. 46, No. 3, 


$. 


iy: 
B99 99500903 oa dco oooocan sa cuocaoccuuccocnccccccceccuccccccccec CUUTTeTeTe Cert test nn nnSESty 


EARLY REVISIONS OF THE BINET- 
SIMON SCALE 


FOUR EARLY REVISIONS 


The two most widely known and used adaptations of the 
Binet scale in the United States are the Stanford revisions of 1916 
and 1937, There were, however, four other revisions which, at one 
time or another, had some currency among psychologists but which 
today are infrequently used or are chiefly of historical interest. In 
1908, H. H. Goddard published a translation of Binet’s 1905 scale; 
and in 1911 he produced, for use in the United States, a revision of 
Binet’s 1908 version. Yerkes published revisions in 1915 and 1923, 
in which the several ty pes of items were grouped as subtests in a point 
scale (e.g., memory span for digits, analogies) instead of being placed 
at age-levels. Herring’s revision appeared in 1922 and for some years 
was used as a valuable alternate in place of the 1916 Stanford scale. 
Kuhlmann’s three revisions, 1912, 1922, and 1939, were extensive 
and elaborate in respect to standardization, scoring, and age-range 
covered. Thus it is clear that a significant amount of psychological 
work had been done prior to the publication of the 1916 Stanford 
scale and that several investigators continued their research and im- 
provements on Binet’s instrument for some years afterwards.! 


1 Students interested in these historical aspects should consult the following: 
H. H. Goddard, “A Revision of the Binet Scale,” The Training School Bulletin 
(Vineland, N. J.), Vol. 8, 1911, pp. 56-62; F. Kuhlmann, 4 Handbook of 
Mental Tests, Baltimore: Warwick and York, 1922, Tests of Mental Develop- 


The Stanford Revision of 1916 115 


THE STANFORD REVISION OF 1916 ° 


The full name of this test, The Stanford Revision of the Binet- 
Simon Intelligence Scale, is derived from the fact that the revision 
was made at Stanford University, under the direction of L. M. Ter- 
man. The construction of this scale was undertaken for the purpose 
of providing an instrument that would be adequately standardized 
and adapted for use in the United States. Its acceptance by psycholo- 
gists and educators is attested by the fact that it was the most widely 
used individual scale until the revised Stanford-Binet appeared in 
1937, 

Although Terman and his collaborators examined approximately 
2300 subjects—1700 normal children, 200 defective and superior 
children, and 400 adults—over a period of several years, the revision 
of the scale below the 14-year level was actually based upon the re- 
sults obtained with about 1000 native-born children in California. 
Each one of these children, representing an unselected group of av- 
erage social status, was within two months of his birthday. 

The 1916 scale includes 90 test items, covering an age range from 
3 years to 14 years, with a group of test items added at the “average- 
adult” level and another at the “superior-adult” level. Of these 90 
test items, 54 were adapted from the 1911 Binet scale, 5 from earlier 
Binet scales, 4 from other American tests, and 27 were new additions. 


Validation. The process of selecting the items involved: (1) the 
comments and notes of the examiners, including the verbatim re- 
Sponses of the subjects to each test item; and (2) the percentage of 
Subjects passing each test at each age level (as an example, see Table 
14). “The guiding principle was to secure an arrangement of the tests 
and a standard of scoring which would make the median mental age 
of the unselected children of each age group coincide with the me- 
dian chronological age. That is, a correct scale must cause the average 
child of 5 years (CA) to test exactly at 5 (MA), the average child 


at 6 to test exactly at 6, etc.” * Or, in terms of the intelligence quo- 


ent: — 7 for Individual Examination, Minneapolis: Educational 
ae EBri S ry eee et al., A Point Scale for Measuring Mental 
Ability, Baltimore: Warwick and York, 1915 and 1923; J. P. Herring, Herring 
Revision of the Binet-Simon Tests, Yonkers, N. Y.: World Book, 1923. For 
brief descriptions of these revisions see the 1950 edition of this textbook. 

?L. M. Terman, The Measurement of Intelligence, Boston: Houghton 
Mifflin, 1916. 

* Terman, op. cit., p- 53- 


116 Early Revisions of the Binet-Simon Scale 


tient employed with this scale, an unselected group of children at 
each age should yield a median of 100. : 

Before the desired results were secured and this criterion satisfied, 
it was necessary to prepare three revisions of the scale. This involved 
the elimination of some test items, the shifting of others up or down 
in age level, and changes in scoring standards.’ “As finally revised, 
Terman states, “the scale gives a median intelligence quotient closely 
approximating 100 for the unselected children of each age from 4 to 
14.” 

The test items above the age of 14 were based on examinations of 
30 businessmen, 150 migrating unemployed men, 150 adolescent de- 
linquents and 50 high-school students. These groups are not a rep- 
resentative cross-section of persons above fourteen years of age in 
the general population. This fact will help make clear why the 1916 
scale was found to be unsatisfactory for use with older adolescents 
and with adults.” The unsatisfactory quality of the scale at the upper 
ages was due also to inadequate sampling of abilities. 


TABLE 14 


Percents Passing Tests Located in Year VI: 
1916 Revision ° 


Test Ages 
4 5 6 7 8 
Right and left 40 50 71 86 95 
Mutilated pictures 27 50 65 87 96 
Counting 13 pennies 30 46 76 93 86 


Comprehension 25 55 70 86 93 
Naming four coins 25 47 74 91 95 
Repeating 16-18 syllables 34 56 69 


+A theoretical or ideal percentage of passes for the pl 
a given age level was not used. Terman states, “We had already become con- 
vinced . . . that no satisfactory revision of the Binet scale was possible on any 
theoretical considerations as to the percentage of passes which an individual test 
ought to show in a given year to be considered standard for that year.” (Op. cit., 
p. 54.) Accordingly, a “trial-and-success” method was used in order to get the 
desired median mental age and IQ at each chronological age level. The same 
practice was followed in standardizing the 1937 revision. 
ĉ In a personal communication Dr. Terman states that there were three ten- 
tative versions of the scale before the final one was published. The businessmen 
and high-school students were used in making the first tentative placement of 
tests at average and superior adult levels. The other adult groups were then used 
in subsequent rearrangement of test items. 
° From L. M. Terman and others, The Stanford Revision and Extension of 
the Binet-Simon Scale for Measuring Intelligence, Educational Psychology 
Monographs. No. 18, 1917, pp. 167-168. (By permission. ) 


acement of a test of 


The Stanford Revision of 1916 117 


In addition to the criterion of a significant increase in the percent 
passing a test item at successive ages, the following criteria were used 
in establishing validity of the 1916 scale. 

First, in each age group, all the subjects tested were divided into 
the following three classes: those testing below 90 IQ, those testing 
between 90 and 109, and those testing | 10 or above. Each test item 
was then examined to determine whether it was passed by a “de- 
cidedly higher” percentage of individuals in the superior IQ group 
than in the inferior. (The term “decidedly higher” was not defined 
by Terman.) Only those test-items which satisfied this criterion were 


retained, The following data are illustrative. 


TABLE 15 


Percents Passing Certain Tests 
Chronological Age Constant * 


Below 96- Above 
96 105 105 


Age 6 Counting thirteen pennies 40 77 96 
Age 7 Describing pictures 48 52 80 
Age § Giving similarities 44 57 83 
Age Making change 39 bo 73 

25 6+ 76 


9 . . 
Age 10 Comprehension of problem situations 


Second. after the scale had been developed, the 1Q’s obtained with 
504 choal children were compared with their scholastic ratings, as 
graded by their teachers, on a five-point scale; namely, very inferior, 
inferior, average, superior, Very superior. Moderate agreement was 
found berven intelligence quotients and school ratings, the coefti- 
cient of correlation being .48—close enough so that Terman and his 
colleagues concluded there was no justifiable “serious suspicion as 
to the accuracy of the intelligence scale. t 

Third. the relation between 1Q and grade progress was studied for 
the children on whom the scale was standardized. A “fairly high” 
correlation was found, but there were also some “astounding dis- 
agreements,” inasmuch as a given mental age level was found ina 
wide range of grades. For example, a mental age of nine was found in 
all grades from 1 to 7. Terman states, however, “When the data were 
examined, it was found that practically every child whose grade failed 


7 From Terman and others, ibid., p. 133. (By permission. ) 


118 Early Revisions of the Binet-Simon Scale 


to correspond fairly closely with his mental age was either excep- 
tionally bright or exceptionally dull. Those who tested between 96 and 
105 IQ [the average children] were never seriously misplaced in 
schools.” * 


Reliability. Following its publication, the Stanford-Binet was sub- 
jected to numerous studies in order to determine its reliability by 
the method of self-correlation. The correlation coefficients, which in 
such studies will vary with the size and constitution of the experi- 
mental group, ranged from about .80 to .95. Such coefficients are 
regarded as highly satisfactory indexes of reliability. 


The 1916 Scale. The reader will have noted that no essentially new 
concepts or principles have been added in the 1916 Stanford-Binet 
scale, as compared with Binet’s own. Terman and his colleagues did, 
however, extend, refine, and adapt the Binet scales, so that the 1916 
revision was a better standardized, hence more valid and reliable, 
instrument. 

The complete list of tests of the 1916 Stanford-Binet follows.® 
Throughout the scale, those items designated “Al.,” instead of being 
numbered, are alternates, to be used in place of one of the numbered 


items, where the examiner, for any reason, believes a numbered item 
to be inappropriate. 


Age 3 
(1) Points to parts of body 
(2) Names familiar objects 
(3) Enumerates objects in pictures 
(4) Gives sex 
(5) Gives last name 
(6) Repeats six to seven syllables 

(Al.) Repeats three digits 


Age 4 
(1) Compares lengths of lines 
(2) Discriminates between geometric forms 
(3) Counts four pennies 
(4) Copies a square 
(5) Comprehends and solves problem situations 
(6) Repeats four digits 
(Al.) Repeats twelve to thirteen syllables 


4 Terman. The Measurement of Intelligence, p. 74. 
* Reproduced by permission of Houghton Mifflin. 


The Stanford Revision of 1916 


(1) 
(2) 
(3) 
(4) 
(5) 
(6) 


(AL) 


(1) 
(2) 
(3) 
(4) 
(5) 
(6) 


(AL) 


(Al. 
(Al. 


(Al. 
(Al. 


(1) 
(2) 


Age 5 


Compares weights 
Names familiar colors 


Makes esthetic comparisons of paired drawings of faces 


Defines common words: use or better 
Puts together a divided triangle 


Carries out three 
Gives own age 


commissions 


Age 6 


Knows right from left 


Perceives missing 
Counts thirteen p 


parts in pictures 
ennies 


Comprehends and solves problem situations 


Identifies coins 


Repeats sixteen to eighteen syllables 
Knows morning from afternoon 


Knows number o0 


Age 7 
f fingers on each and both hands 


Describes pictures 
Repeats five digits 


Ties a bow-knot 


Gives differences between paired objects 


Copies a diamond 


Names days of W 
Repeats 


Traces p: 


object in a field 


Counts backward: 
Comprehends 
Gives similarities 
Defines names of 
Defines twenty W 


Identifies six coins 
Writes short sentence from 


Gives date: d 


Discriminates between 


Makes change in 
Repeats four digi 
Makes up a sen 
Gives rhymes to 


eek in correct order 


three digits backwards 


Age 8 


s from 20 to 1 


and solves problem situations 


between two things 

objects in terms superior to use 
ords from a vocabulary list 
dictation 


Age 9 


ay of week, month, day of month, year 


weights: 3, 6, 9, 12, 15 grams 
small amounts 
ts backwards 


tence including three given words 


three words 


119 


ath to be followed in a systematic search for a lost 


(Al. 1) 


(A 
(A 


I, 72) 
1.3), 


Early Revisions of the Binet-Simon Scale 


Names the months of the year 


Gives total value of a group of one-cent and two-cent post- 
age stamps 


Age 10 
Defines thirty words from vocabulary list 
Detects absurdities in statements 
Reproduces two designs from memory 
Reads a short passage and reproduces content 
Comprehends and solves problem situations 
Names any sixty words by free association 
Repeats six digits 
Repeats twenty to twenty-two syllables 
Fits rectangular blocks into form-board 


Age 12 
Defines forty words from vocabulary list 
Defines abstract words 
Traces a path in systematic search (same problem as in ye 
8, but a superior plan is required here) 
Rearranges dissected sentences into meaningful sentences 
Interprets fables 


Repeats five digits backwards 
Interprets pictures 
Gives similarities between three things 


ar 


Age 14 
Defines fifty words from vocabulary list 
Discovers a rule in a paper-folding test (induction test) 
Gives differences between a president and a king 
Integrates given facts and 
ing them 
Solves arithmetical reasoning problems 


Reverses hands of clock, in imagination, and gives the hour 
Repeats seven digits 


arrives at a conclusion concern- 


Average Adult 

Defines sixty-five words from vocabulary list 
Interprets fables 
Gives differences between abstract words 
Solves problem of number of enclos 

boxes) when shown only 
Repeats six digits backwards 
Perceives the pattern of a code and uses it 
Repeats 28 syllables 
Comprehends problems involving physical rel 


ed boxes (boxes within 
the large outside box 


ations 


The Stanford Revision of 1916 12% 


* Superior Adult 
(1) Defines seventy-five words from vocabulary list 
(2) Visualizes, imaginally, and draws appearance of a folded 
and cut piece of paper 
(3) Repeats eight digits 
(4) Repeats thought of a passage heard 
(5) Repeats seven digits backwards 
(6) Solves problems involving “ingenuity” 


* The Scoring Method. Each age level from 3 years through 10, it 
will be noted, has six test items (plus the alternate which may replace 
one of the six). Each of these carries credit of two “months,” so that 
the tests in each of the age levels provide a year’s increment in men- 
tal age. 

There are no tests at the eleven-year level, the reason being that 
the authors of the scale were, apparently, unable to devise tests that 
would indicate a one-year difference at this stage of mental develop- 
ment. This gap in the scale, it is believed, is due to the slowing down 
of mental development, thus decreasing the annual increments and 
making it more difficult to measure those increments by means of the 
then-available test items. Since the eight test items at age twelve cover 
a span of two years, each one carries a credit of three months in order 
to yield an average mental age of twelve, when added to the ten-year 


level. 


The same explanation 
which gives a credit of four months toward mental age score. 


-The tests and credits at the “average adult” level were devised so 
as to provide a median value mental age of 16. Yet each of the six 
tests at average adult level carries a credit of five months, so that a 
Person who passed all of them would get a mental age of 16.5 years. 
In his volume describing the standardization of the 1916 revision, in 
explaining the limit for “average adult,” Terman states that his data 
on mental ages of 62 adults, including 30 businessmen and 32 high 
School pupils, who were over 16 years of age, show “. . . that the 
Middle section of the graph [of the distribution] represents the ‘mental 
ages’ falling between 15 and 17. This is the range we have designated 


as the ‘average adult’ level.” 
Those persons having mental 


10 The 1937 revision provides a group of tests at age 11 and another at 13. 
1! Terman, The Measurement of Intelligence, p. 55. 


applies to the six test items at age 14, each of 


10 


3 11 


ages above seventeen were designated 


(cuorssnusod Ág) “82 “4 ‘9161 uyy 
uoIysnopy voog ‘aauaSyjaiuy fo puou ansv W ayy “UBUD, WOI y 


Suroq sassaooid [eorsojoyoAsd əy} 0} qoadsar ur SN}eIS PANLOI Senp 
-IAIPUI ue Surup JO sueəw UO sapraoid yf WY UL JULIYIUSIS sy 
SIP sv Yyons 9[qQV} V * (s14) YA peonuap! 10U ST joulg-PIOJULIS LES] ou 
10} UONNQIASIP ay} “3-a) sisa) pre 104 TeoHUAP! 10 paxy JOU IB Saio] 
OI unao mojeq JO əaoqe səĝezuəvrəd Suroĝə10} 9) ySnoyy 


m ROL p WAR a Y 
a oe SUb AST ” ” 
5 s OL w A0 » ” 
mi ee ALL a SL v ” 
we we SUL ae “Ul ” ” 
a EEL age OS ” ” 
w SEL ty YE ” ” 
a) ay SOL op ee ” ” 
ƏAOqL 10 Qg] PL WI gs01 L 
s ah om SERGE n ” 
a a GD. oe ae eE ” ” 
” Rl UG). n age OE: i? ” 
We | HIBS as ope ST ” ” 
hi 258 wn ape “POOL ” 
» SL we: a AS ” 
a ADE nine. AS ” 
ss aye ee ae te, ORE E » 
MOOG 10 04 0} 08 %] SOMO] LL 


er QTÓT ‘9[BIS PPU g-P104}ULIS 
:s OI JO uonnqiysiq 95vjUIIIIg 
Lt IT4VL 


isnt ‘O] UAIS e A\OJAq pur W JO 340P pue I? spəfqns 

Jo osviuso1ad əy) awəpur 0} sem n290 adUdsTJOIU! JO SI918ƏP qua 
-1341p yona yua £ouanbəsy əy} Juasaidas 0} pasn poyppəwu Joylouy 

“SUPT 

-gA ƏJQVLIƏPISUOI pey JoUIg-psOjULIg 9161 IP WY? SIƏYO Auvw pue 

ueu JO Jorjaq ə pauaySuams ə1ozəIəy} J 'PNSIA Suq səssə 

-301d jejuətu dy) SUIIDUOD sv Ie} Os Ysea Iv ‘pourwexa suOsiad əy) Jo 

Ayordeo jejuəwu JO SOA] Jeanas ay} UI0AQ JILNUIIOYIP PIP ƏPƏS 
ay} Jey} Pamoys ‘suo jeonyowwks Ape] e Suq “UONNQINSIP SIUL 

Cuorstaat LEGI I JO squtod g] Jo “q's op YA sty) s1edwoD) 

‘sjurod Zg ynoqge st uoynquysip sty) Jo (A'S) UONLIAEpP pivpurys 4L 


ETI gt6rz fo uolsiaay psofuvis AYL 


Cuorsstuad fg) ‘Or `d “319 ‘do ‘s1ayjo pur Uew j, WOI zr 


$s°0 StI-9ET 
EZ SEI-9ZI 
06 SZILI 
TEZ STI-901 
CEE S0I-96 
roZ S6 -98 
9'3 $8 -94 
et $3 
]E10} Jo a 
qUIIIIg 


zı SILA FI-S sosy 
‘UIP pejoajasuy) £06 jo s O] Jo uoynquysiq 


Qt atava 


:SUIMOT[OJ Əy} 9q 0} Puno} sem uonnquy 
E ol} Usamjaq sjoafqns əsoyp Surye y, 
“Poziprepur}s sem 
a[vos oy} Woy uo suosiad ay) Áq Pəurezqo sjuanonb aduasiy[o}uUr Jo 
YonNqiysip əy} pəzÁjeue əy Inq ‘pouruexa [enplalpur yora 103 OJ oy) 
punoy Áļuo jou ueway, ‘pourejdxa useq Apvaiye oavy yoy Jo oinjeu 
[e1ouas pue uonepmojpeo əy) uənonb dOUdST|[OIUT əy} sı Jəurg-p10} 
UVIS 9161 OY} YUM pəurezqo ə10əs puosas y ‘SOI JO uonNqiysig 


-SIP oy ‘SILI pp pue ç JO sas 


*syUOUW! QT 
‘sivak st o8v yejuow Sadafqns sp ‘snyy ‘aasq 1eəÁ-6 oy qe poles 
ore [e ssyuOW p Jo pər Təyin} SULIS ‘Joad 1eəÁ-g ay) ye possed 
AR OM} fsyyuow 9 JO pər leuonippe Surg ‘oA, I3Á-L ƏY) 7e 
Passed uay} ore sua 1S3) 9914} ‘9 SI 183Á [eseq oY} IDULIJSUL UAIS V UT 
Py Iunsse Həjduexə 104 ‘ade [ezuəU ƏY} SI 1270} ƏY} IVI Jeseq əy 
JO an[va e ayy 0} PƏPpe aie sypord ƏSƏ L “a109s Fe [ejua əy} 0} Sur 
~INGINUOS ‘syyuow Jo sw1ə} ur p319 payroads sates Wd}! 459} yovo 
‘paieis Spragye sy ,-3vak peura, oy} Pae SI [3A SIL ‘SWN qpe 
spey əfqns əy) oxoym payoead st oad] Ə mun ayeos ay} ul pavadn 
spoado.id vay} Jouruexe AL ,, Iva [eseq,, ap Pal[eo st siyy, ‘Sway Te 
sossed Jəəfqns əy} orayar PILI st foray ay} [HUN 4so} oy} Ur umop 
S908 JoUIWeXa AJJ, :SIY} SI ƏV QTE] P BUIIODS JO poypu oy, - 

CIPA ANPR asvsoar,, IYI ie oqeurene S'I Jo 
WINLUTXBLT OY] 0} poppe “Youa Upasd ,syUOUI xIS ‘s}so} XIS) “S°6] Fuq 
9[vos SIY} UO ose [eyUoW wnwrxew afqissod ay} ,,‘syjNpe Jouiadns,, se 


a[VIS uoUlls-Jaulg ayl fo suoisiaay KY SET 


124 Early Revisions of the Binet-Simon Scale 


measured; for, as already explained, the IQ is an index having educa- 
tional and clinical connotations. 

On the basis of the distribution of intelligence quotients obtained 
with the 1916 revision, Terman also suggested the following classi- 
fication: 


TABLE 18 


Suggested Classification of IQ's: 
Stanford-Binet Scale, 1916" 


10 Classification 
Above 140 “Near” genius or genius 
120-140 Very supcrior intelligence 
110-119 Superior intelligence 
90-109 Normal, or average, intclligence 
80-89 Dullness 
70-79 Borderline deficiency 
Below 70 Definite fecblemindcdness 


This classification is reproduced because it has been widely used 
and because the reader should be familiar with its source. Unfor- 
tunately, however, such classifications or labels have frequently been 
used uncritically, the erroneous assumption having been that there 
was some quality inherent in the classification itself that warranted the 
designation of “genius,” or “superiority,” or “dullness,” etc. It has 
already been pointed out, for example, that not all persons of very 
high IQ’s have original, creative mentalities; yet these are among the 
traits of the “genius.” The particular IQ intervals and the names at- 
tached to them result from the judgment of the specialist making the 
classification. Some classifiers might choose to place the lower limit 
of “near genius or genius” at 150 IQ, or the upper limit of “feeble- 
mindedness” at 50 or 60. In short, tables of IQ classifications are 
essentially statistical. 

Regardless of size of intervals or of their names, a classification is 
useful chiefly as a convenient device for purposes of research and 
analysis of data. It should not be used merely to label and pigeonhole 
an individual who has been examined and for whom an intelligence 
quotient has been obtained. The trend is away from stating test re- 
sults in terms of MA and IQ alone; the trend is in the direction of 
evaluation of individual performances. 


144 From Terman, op. cit., p. 79. (By permission. ) 


The Stanford Revision of 1916 125 


Adult Mental Age and Adult Intelligence Quotient. Since the 1916 
Stanford-Binet includes tests at the levels of average adult and supe- 
rior adult, it was necessary to make provisions for the calculation of 
adult mental ages and intelligence quotients. These, however, present 
special problems. 

We have already quoted Terman’s reason for locating average adult 
performance in the mental age range of 15 to 17, with the assumed 
midpoint at 16 years. If this is correct, it means that the test perform- 
ance of the average adult is equal to that of the average 16-year-old in- 
dividual, Otherwise stated, it means that in the case of an average 
adult, his maximum level of measured intelligence is reached at the age 
of 16 and that there are no increments thereafter. Terman states that 
“| in so far as it can be measured by tests now available, [intelli- 


ittle after the age of 15 or (i ge 


gence] appears to improve but li e ag 
Although this point [at which intelligence attains its final development] 
is not exactly known, it will be sufficiently accurate for our purposes 


to assume its location at 16 years.” 15 Thus, until the process of decline 
sets in, the average adult continues to have a mental age of 16, accord- 


ing to the 1916 Stanford-Binet. 


On the basis of this assumpti 
years of 


on, then, in the calculation of an IQ 


for a person who is 16 age, or older, the denominator in the 


formula (IQ = MA) is always 16. Otherwise, if his actual CA were 


used, he would appear tO be getting rapidly less and less intelligent 
with the succeeding years. For example, an average individual at the 
age of 16 will have an 10 of 100 (16/16). At the age of 18, he should 
still have an 1Q of 100 even though, according to the tests being used, 
there has been no further measurable development of mental capacity; 
for the formula will still be 16/16. Now if, in the case of this same 
individual, we had continued using his actual CA as the denominator, 
his IQ at age 18 would be shown as about 89 (16/1 8); at age 20, it 
would be shown as g0 (16/29), and so on, while as a matter of fact 
there would ordinarily have been no such decline. Thus, by using the 
denominator of 16 in the 1Q formula for all persons above age 16, if a 
person of 20 years and one of 60 years have the mental age of 16 


years each, then each will be given an 1Q of 100. 
The reader has also noted that it is possible to attain mental ages 


eee 


1 Op. cit., p. 140. 


126 Early Revisions of the Binet-Simon Scale 


above 16 on the 1916 revision, the maximum being 19.5 years at the 
level of superior adult. If the definition of mental age is borne in mind, 
it will be apparent at once that a mental-age rating which is higher 
than that of the average adult has a new and specialized meaning. It 
cannot have the same meaning as the term, mental age, does ordi- 
narily. A mental age is defined as the level of mental development of 
the average or typical group of persons at that same chronological age. 
Thus, an MA of 10 represents the test performance and mental level 
of a group of average children of chronological age 10. Hence, if we 
assume that average or typical adults reach a mental age of 16, then 
to speak of a “mental age” above 16 is to introduce a new concept; for 
these latter “mental ages” are not derived from the performance or 
norms of average or typical persons. They are theoretical and hypo- 
thetical indexes devised to enable us to indicate higher than average 
mental levels and higher than average intelligence quotients.’ Thus, 
when a higher-than-average adult “mental age” is used, it is essential 
that the user be aware of the fact that a new and different concept is 
being employed, 

The fact that the highest possible “mental age” that can be attained 
on this test is 19.5 years means that the highest IQ an adult can get is 
about 122 (19.5/16). This maximum reveals a serious inadequacy of 
the 1916 revision at the higher levels. What, for example, happens to 
the IQ of the 10-year-old child who has a mental age of 15, and an IQ 
of 150 (15/10)? Obviously, to maintain that IQ of 150 at age 16 or 
older, he must be able to attain a mental age of 24 (24/16); yet the 


scale permits a maximum MA of only 19.5, with an IQ of 122. The 
same would be true for this subject after age 16. 


Criticisms of the 1916 Stanford-Binet. Experience with the 1916 
Stanford-Binet demonstrated that it was inadequate as a measure of 
adult mental capacity. In fact, experience showed that its usefulness 
was restricted to ages between five and fourteen years, the range be- 
tween five and ten years being the most satisfactory. 


16 In their volume describing the 1916 revision (op. cit.) Terman and his 
collaborators do not report how they arrived at their mental-age levels above 
those of average adult. In their 1937 revision this has been done by extrapola- 
tion and by providing for a distribution of adult 1Q’s which should correspond. 
to distributions for pre-adult levels. 


The Stanford Revision of 1916 127 


This revision was also criticized on several other grounds. First, 
since the scale was finally standardized on the basis of results obtained 
with approximately 1000 native-born children in California, its use 
with all groups of children in all parts of the United States seemed to 
many educators and psychologists to be a practice of doubtful validity; 
for it was held that the 1000 California subjects were not necessarily 
representative of the child population of this country. There is merit in 
the criticism, yet it must be recognized that this scale proved to be 
very useful in many parts of the United States, when employed and 
interpreted by examiners who were familiar with its assumptions and 
construction, and who, at the same time, were familiar with the back- 
grounds of the subjects they were examining. 

Second, the scale was criticized as being much too heavily weighted 
with verbal and abstract materials, thus penalizing the individual who, 
for whatever reason, had been handicapped in developing his “verbal 
intelligence” through the medium of the English language. Terman’s 
reply to this criticism was that intelligence at the verbal and abstract 
levels is the highest form, the sine qua non, of mental ability. Indeed, 
he defined intelligence as the ability to deal with abstract terms, and 


to do conceptual thinking. 

This criticism of the scale was warranted, nevertheless; for children 
who are handicapped by lack of opportunity to acquire and develop 
the use of the English language are at a serious disadvantage and get 
spuriously low ratings on psychological tests which emphasize “verbal 
intelligence.” Such children would include: (1) those who have de- 
ly a foreign language 1s spoken; (2) those 
who are handicapped by serious visual or auditory defects; (3) those 
handicapped by sensory anomalies (reversals, inversions, mirror- 
Writing, poor sound discrimination) which seriously interfere with 
their learning to read; (4) those who are too young—say, below age 
four or ie- be tested adequately by means of verbal materials al- 


most exclusively. 

Third, the 1916 scale 
respect to procedures in a 
from its objectivity and from the c 
different examiners. 

In view of these criticisms, it W 
should be developed, especially those o 


veloped in homes where on 


was found to be defective at some points in 
dministering and scoring, thus detracting 
omparability of results obtained by 


as to be expected that other scales 
f the “performance” and “non- 


128 Early Revisions of the Binet-Simon Scale 


verbal” type, which would obviate or minimize the second criticism. 
These will be presented and discussed in a later chapter. 

It was to be expected, also, that the 1916 Stanford-Binet itself 
should undergo revision in the light of experience, criticism, and ac- 
cumulated data. Such a revision, begun about ten years after the origi- 
nal Stanford. Binet appeared, was published in 1937. 


6. 


pevvvveevvvveevrrereverrerrerererertrttrececcracc cc ctO0sSO0000.5000000000000000000005000000000000005200500008 aaa 


THE 1937 REVISION 
OF THE STANFORD-BINET SCALE 


DESCRIPTION OF THE 1937 SCALE 
ale differs from that of 1916 in many details, but it does 
sic conceptions.’ As the authors them- 
selves state, “The revision utilizes the assumptions, methods, and 
principles of the age scale as conceived by Binet.” The authors, how- 
ever, do regard it as a better-standardized and more useful scale than 
its predecessors. The principal differences and modifications follow. 

The 1937 scale has two equivalent forms (L and M), each of which 
contains 129 test items, as compared with the 90 items in the first 
Stanford-Binet. Items that proved unsatisfactory in the original were 
eliminated, and new ones were added. 

The 1937 scale extends downward to the level of age 2, and upward 
through three levels of “superior adult” (known as Superior Adult I, 


II, and II), thus increasi 
The levels below age 5 
carefully and validly standardized. sx 
Scoring standards and instructions for administering the tests were 


improved. 


This sc 
not differ in its essential and ba: 


ng its usefulness. 
and those above age 14 have been more 


d M. A. Merrill, Measuring Intelligence, Boston: Hough- 
also R. Pintner and others, Supplementary Guide for the 
t Scale (Form L), Applied Psychology Monographs, 
presents, for purposes of comparison, a collec- 
but does not duplicate those provided in the 


1 L. M. Terman an 
ton Mifflin, 1937. See 
Revised Stanford-Bine 
No. 3, 1944. This monograph 
tion of responses to test items 
Terman-Merrill manual. 


130 The 1937 Revision of the Stanford-Binet Scale 


From the age of 2 to age 5, this scale provides groups of test items 
at half-year intervals. Thus more accurate and more highly differen- 
tiating test results are obtainable. The half-yearly intervals are possible 
because the rate of mental growth is most rapid in the earlier years 
and, therefore, the more rapid periodic increments are susceptible to 
testing. 

Groups of tests are provided at ages 11 and 13, whereas there were 
none at these levels in the 1916 scale for reasons already stated. 

Although the 1937 scale, like that of 1916, is predominantly verbal 
in character, it does provide more performance and other nonverbal 
materials at the earlier age levels, especially through the age of four 
years. Performance materials are those with which the subject has to 
do something; for example, building a pattern or making a design with 
blocks, or filling in a form board with the variously shaped blocks. 
Other nonverbal materials include such activities as copying a geo- 
metric figure, completing the picture of a man, discriminating between 
forms, etc. In all of these, of course, verbal ability is a factor to the 

extent that verbal directions must be understood. In these tests, verbal 
ability can also operate as a factor if the subject is familiar with the 
names of the objects or geometric figures and is thus facilitated in 
his manipulation or classification of them. 

The 1937 scale has been standardized on a more carefully chosen 
and much more extensive group of subjects. The b 
ardization population was broadened, 
regarded as more representative of th 
can-born white subjects were used in the standardization of this scale, 
the total number being approximately 3000. The subjects were chosen 
from eleven states in several widely separated areas of the country, 
and an effort was made to have the subjects from homes which, oc- 


cupationally and socially, would be representative of the population 
at large in the United States, 


ase of the stand- 
and its component members are 
€ population.” But only Ameri- 


VALIDATION 


The test items were chosen on the basis of their v: 
and objectivity of scoring, economy of time in administer: 
to the subjects, and need for variation in types of materials. 


alidity, ease 
ing, interest 


? Up to the age of 5, the number of sub 
level; from 6 to 14 years of age, 
each year. 


jects used was 100 to each half-year 
200 at each year; from 15 to 18 years, 100 at 


5 


Validation i3 


Of the foregoing, validity is of primary significance. In this revision, 
a criterion of basic importance in judging validity of test items was in- 
crease in percentage of successful performance with increasing age. 
This criterion was applied in two ways: first, by requiring an appreci- 
able increase in the percent passing a given item in successive ages 
(as in the 1916 scale); and second, by finding “a weight based on the 
ratio of the difference to the standard error of the difference between 
the mean age (or mental age) of subjects passing the test and of sub- 
jects failing it.” * Stripped of its statistical terminology, what this quo- 
tation means is that the difference between the average age (chrono- 
logical or mental) of subjects passing an item, on the one hand, and 
the average age of subjects failing that item, on the other hand, must 
be statistically significant. This is essentially an “age criterion.” In this 


connection see Table 19. 


TABLE 19 
Percent Passing Test Items Located in Year VI,' 
Form L 
Ages 

Ite 4 mM 5 5M 6 7 8 
w 3 15 3% 30 6 89 7 
2. u 9 # 5 7 $6 95 
x mio 26 46 iH BGG 
x > 1 v8 aw 
5. 6 2 «+ 1 p A 95 
6. % H# 5 ol 81 9 93 


f major importance in the retention of an item 
was its correlation with the zotal scores of the individuals of the age 
level at which the test item is located. Table 20 presents the distribu- 
tion of correlation coefficients (biserial) for both forms L and M. , 

The calculated median of the coefficients for form L is approxi- 


mately .69; the middle 50 percent of the coefficients fall between ap- 
and .73. The range for the whole set of coefficients 1s 
and .73. 


r designs, year !1) to .89 (abstract words, year 


A second criterion 0 


proximately .51 
from .28 (memory fo 


11; and vocabulary, year 14). 
op. Cit. P- 9. See also V. V. Fleming, “A Study of 
d Stanford-Binet Scales, Forms L and M,” Journal of 


64, 1944, pp- 3-36. 
The Revision of the Stanford Scale, Boston: Houghton 


on.) 


# Terman and Merrill, 
the Subtests in the Revise 
Genetic Psychology, VOl- 

1 From Q. McNemar, Rev 
Mifflin, 1942, p. 92. (By permiss! 


-gprejop aU} Ul S2QUDIAYIP MO} E ‘Iaaamoy ‘IE DAT oo ee 

-ur pue ose [ejua JO Uoneurup ay} 10} akg S ae n 
yep se ojdioutid ur owes əy} Apjoexe S! poy}? G 

INALLOND 

AONSOIVIALNI ANY FOV TININ ONININNA LAG 

$1.1 99S) "SPMS quanbasqns ur s1ayjo 


“(179 on! 
xt? s ayy Jo 21V SIUDIOYJOOI asoy yp 


Aq punoj asoy} se Japio Jesquas owe: 


Çuopsruod Sa) SF S E BofOUONID 1 sD] 
‘puy pue ueu gp Wop 6 517 PV > LIÇO "De 
n ee pue q uro, U99499 UOHEIIOD J Peyo PHBS “VY A 


DIN 
RERSRSEERSeeasysas asses 
EESERSEESSSeses2 se SES" wy 

6rSb 
1 79-08 
F| | 65-9 
| l H HH 19:09 
H iz 14 69:59 
i pl-0L 
| H 14 ers 
A ¥8-08 
ine 68-58 
a 76-06 
cose ™ 
Horor 5 
60r-S0T 
PIL-OIT 


6IL-SIT 
PeL-02T 
621-S2L 
vere 
6ET-SET 
¥PL-OFT 
6PL-SPL 


“sdnoi3-aSe ayesedas 104 parejnojeo ‘( gg") XIS MoJaq s04} ULYT (£6) 
Kuqe 1942318 paaoys sivaX XIS 3A0QL SJA əy (76) OO! vou 
$.O] 10} s}ua0yJ209 ƏPIPƏWJJU! pUe ‘(06') OEI Boge SOI 10} Iso 
-moj 24) “(86') OL MOJ2q SOI 10} PUNOJ 219M SJUDIOYJIOI ysoysry L 


EE Jualjon© 29428111214] puv asp [DIUIW SUIUIUAIJIG 


"EZ d “119 ‘do ‘JW pue uvula L y 
“ps PUL ¢¢ səjqeL “7 ‘do “IPWANI ‘Ò ul vep uodn pasrg e 


"86° 0} 06" Wory uone Jo syua!oyJa09 Apqerjas odar JW 
pur WUL ‘W pue q suog yim Pourejgo s.O] Sunedwop 


ALITIGVITAU 


ə «038 Aq Burssed sjuassad JO saaind jo adeys pue 
‘Ayptea ‘Aynowtp 0} Iadsa1 YU [arg] 3FL yor ye qJ W104 Jo asoyy 
YoU O} SE OS s}sa} Sy SuiBuvsse Áq W uog YMA y[Nse1 poog Ayjenba 
UR DUO IB JAIMY 0} aqissod sem pW: - - “Udy, “J WOJ 10} siy} 
saalyor 0} Aressadou 319M SUOISIAD DAISSIOINS XI “QO] 0} əjqıssod sv 
aso] se OJ Urs e Fung ‘ase [eo1Sojouosyo uvaw 1 ym Jeonuapr 
SBA yey} Sv jezuəwu ULoU e syosfqns Jo dnog yora 10} Por pjnom 
Aayi iy} Puno} sem UW [Nun poSueuvas orom SW L `A ose 
ayeiidoidde uv ye wa 489} yora uneso ur pakojduia sem dinpssoid 
[volutdtua 19470 IUO ‘pasn aq 0} IIM FLY) $IS9) Əy} JO uonoaəs 1914Y 


OTI OCI N 
T =06 
6 8 —08" 
EZ tE = 08 
cr ce -09 
El 87 —05" 
lz IZ =i 
ja S m2 
T I OE 
JN uuo 7 uuo] i 
AQuonbat yy AQuonbo.y 


s SIOD [JO], YHA W3} YORI 10} 
(pnəsig) syudtsyja0D uonepuoo Jo uonnqusıq 


OT JIYL 


‘daISap YSIYy BO} UOWWOD ur suonouny Jeor3ojoyoÁsd savy yey} 
SWI} Sd} JO suvatu Aq „ÁPR [vIDUAS,, SAINSvAW IJLIŞ JOUIG-psOJuLIs 
ay} WY) aBduapIAd SuoNs aprAoid (pajsay suoyouny fo sisk}oup) 139 
-deyo sty} ul uonsas 194e] V ur paquasaid LIP ay) PUL 792} SIUL “OYSIY 
JO Q` 3e ud gy Ajivau AIDA 10 ‘JOZ ‘sJUaIDYJa09 ZGT A JO 

"(EI Ie3Á ‘spom ypLnsqe) [6° 0} (EI BOA ‘səarozs 10} AIOWISU) 
LT WOJ} SI syUaIDYJood JO Jos IAOYM SIY} I0} FULI YL 'IL' PUL IF 
Ajayeurxoidde usamjaq [Ve] SJUaIYJa0d ay} JO Ju21əd Qç APPU A} 
‘p9 Ápwrxodde st yuarsyja0o uerpo panje Iy} ‘N WO1} UO 


a[DIS 12u1g-p10fuviş əy fo uois1aay LEGT IYL TÉT 


134 The 1937 Revision of the Stanford-Binet Scale 


Whereas the maximum mental age attainable on the 1916 Stanford- 
Binet was 19 years and 6 months, the maximum on the 1937 revision 
is 22 years and 10 months. It will be recalled that with the first Stan- 
ford-Binet scale, a maximum CA of 16 was used in the denominator 
to determine the IQ of an individual of 16 years of age or older. In 
the 1937 scale, the maximum CA in the denominator is 15. Thus, the 
highest possible IQ attainable by a subject who is 15 years of age or 
older is 152 (227/15). 

In the superior-adult levels, the test items were selected and their 
credits allotted (in terms of months) in such a manner as to make the 
IQ distribution of “. . . the older subjects resemble closely those 
of the younger, as presumably should be the case on an ideal scale.” 7 


TABLE 21 
Correction of CA Divisor: 1937 Stanford-Binet Scale ° 


Corrected 

Actual CA CA Divisor 
13-0 13-0 
3-3 13-2 
14-0 13-8 
14-6 14-0 
15-0 14-+ 
15-6 14-8 
16-0 15-0 


In order to achieve this desired goal, it was necessary for the authors 
to make adjustments in the denominator of the IQ formula, beginning 
at the age of 13 years and 2 months, The reason given for this adjust- 
ment is that it is extremely difficult, perhaps impossible, to escape the 
effects of selection of subjects at the upper ages in standardizing a 
scale. The selection generally is such as to include the average range 
and the higher mental levels but not the lowest, since the less intelli- 
gent individuals tend to leave school earlier than the others. Hence, 
the norms of test-performance of the older groups, it is argued, tend to 
be higher than they should be for an unselected sampling. These 
higher norms, in turn, tend to reduce the intelligence quotients of 
subjects in the older groups. It is this effect that the authors of the 
scale sought to correct by means of adjusting the denominator, 


7 Ibid., p. 30. 
$ From Terman and Merrill, op. cit., p. 31. (By permission.) 


Determining Mental Age and Intelligence Quotient 135 


While Terman and Merrill believe they minimized the effects of 
selection, they were not wholly eliminated. Therefore, after a “trial- 
and-success” process directed toward making 1Q distributions of older 
age groups resemble closely those of younger groups, the procedure 
adopted was this: from age 13 to age 16, the cumulative dropping of 
one out of every three additional months of chronological age, and all 
of it after 16. In substance, this practice is tantamount to saying that 


TABLE 22 
TQ Means Adjusted for 1930 Census Frequencies 


of Occupational Groupings * 
Composite of Forms L and M, Stanford-Binet 


Age N Raw Smoothed 
2 76 102.1 
2 7+ 104.7 103.3 
3 81 103.2 104.1 
34 77 104.3 102.2 
4 83 99.2 01.6 
44 F 101.2 100.8 
5 90 101.9 00.4 
5 110 98.2 00.0 
6 203 100.0 99.8 
7 202 101.2 00.8 
8 203 101.1 02.0 
9 204 103.6 102.7 
10 201 103.5 03. 
11 204 101.9 102.2 
12 202 101.2 101.6 
3 204 101.8 01.0 
14 202 100.0 01.3 
15 107 102.0 101.3 
16 102 101.8 03.3 
17 109 103.2 3.8 
18 101 106.3 


average adult mental age, On the 1937 scale, is 15. Table 21 gives a 


few examples. 

When, therefore, 
than thirteen years, it is ne 
Provided in the manual. O 
Provided in the manual, in W 
ready been made. 


“From Terman and Merrill, op- cit., p- 36. (By 


this scale is used with subjects who are older 
cessary to refer to the full correction table 
r the examiner may use the tables of 1Q’s 
hich the necessary adjustments have al- 


permission. ) 


136 The 1937 Revision of the Stanford-Binet Scale 


DISTRIBUTION OF IQ’S 


The mean IQ's, for the subjects used in the standardization, 
are slightly above 100. But this, the authors say, is due to an “inten- 
tional adjustment to allow for the somewhat inadequate sampling of 
subjects in the lower occupational classes.” The adjustment was made 
by dividing the subjects into seven groups, according to the occupa- 
tion of the fathers; at each age level the mean IQ was computed sepa- 


24 
22 


35- 45- 55- 65- 75- 85- 95- 105- 115- 125- 135- T45- 155- 165- 
44 54 64 74 84 94 104 114 124 134 144 154 164 174 
1.Q. 


FIC. 6.2. Distribution of Composite L-M IQ's of Standardization Group. 
Terman and Merrill, Measuring Intelligence, Boston: Houghton Mifflin, 
p. 37. (By permission. ) 


rately for each of the seven groups. These means 
were given a weight according to the occupation 


each group, as shown by the 1930 census. The wei 
then combined into 


years, as shown in 
cally in Figure 6.2. 

In determining the equality and com 
to age, it is necessary not only 
(ideally, identical), 
levels. If the differen 


at each age level 
al frequencies of 


ghted means were 
a composite mean for each age level from 2 to 18 


Table 22. The same data are represented graphi- 


parability of 1Q’s from age 
that the means be very much the same 
but that the variations be the same at all age 
ces between the variations of the age groups are 


Distribution of IQ’s 137 


large, then the same numerical IQ will have different significance at 


different chronological ages. 
Consider the following hypothetical instance. Suppose a given test 


of mental ability yields the following results: 


Chronological Age Mean IQ Standard Deviation 
10 100 1+ 
11 100 20 


Accordingly, a 10-year-old child having an IQ of 86 (that is, one 
standard deviation below the mean) would have a percentile rating 
of approximately 16—which, it will be recalled, means that this child 
surpasses about 16 percent of his age group. Now, according to the 
foregoing data, a child of 11 years whose IQ is 80 (likewise one stand- 
he mean of his group) would also have a per- 
he fact that his intelligence 
-year-old in question, While 


ard deviation below t 
centile rating of about 16, in spite of t 
quotient is six points below that of the 10 
this difference of six points may make little practical difference in the 
al treatment of these children, it is necessary to 
be familiar with the implications of differences in variations. 

Another aspect of the problem is this. Taking our hypothetical 
Case of the 10-year-old boy, if he is to have at age 11 the same relative 
rank he held at age 10, his IQ will have to drop from 86 to 80; and 
if this happens, the change will give the appearance of lack of 1Q 
Constancy; but, if he maintains his 86 IQ at age 1 L, then his percentile 
rank will be approximately 24. Here, again, the difference in percen- 
tile ranks might have little practical significance; but insight into the 
effects of differences in group Variations will help explain some in- 
stances of lack of close agreement in repeated mental measurements. 
Furth ermore, the professional worker who uses mental tests must 
know the standard deviations of the instruments he employs in order 
to make as accurate an evaluation of his results as possible. 

IQ variability in relation to age, as found in the standardization of 
the 1937 Stanford-Binet is shown in Table 23. Inspection of this table 
Shows significant fluctuation in standard deviations of the several age 
8toups, especially between the extremes: namely, 12.5 (S.D.) at age 
6, 20.6 (S.D.) at age 22, and 20.0 (S.D.) at age 12. It will be noted, 
too, that the standard deviations fluctuate around 16 and 17 as a 
Median value. The standard deviation of the composite 1Q’s (Forms 
L and M) is 16.4 for the entire standardization group of subjects. 


clinical and education 


138 The 1937 Revision of the Stanford-Binet Scale 


In respect to the fluctuations in IQ variability, the authors state: 
“Notwithstanding our strenuous efforts to correct for . . . errors of 
sampling, complete success is hardly to be expected, and a consider- 
able degree of irregular fluctuation in the found magnitudes of IQ 
variability from age to age could reasonably be attributed to these 
sources of error. . . . Since inspection of the values reveals no 


TABLE 23 
IQ Variability in Relation to Age ® 
Stanford-Binet 


SD: SD, 

CA N Form L Form M 
2 102 16.7 15.5 
2⁄2 102 20.6 20.7 
3 99 9.0 18.7 
3% 103 17.3 16.3 
4 105 6.9 15.6 
4a 101 6.2 15.3 
5 109 14.2 14.1 
52 110 14:3 14.0 
6 203 12.5 13.2 
f: 202 16.2 15.6 
8 203 15.8 15.5 
9 204 16.4 16.7 
10 201 16.5 15.9 
LI 204 8.0 173 
12 202 20.0 19.5 
3 204 17.9 17.8 
14 202 16.1 16.7 
15 107 19.0 19.3 
16 102 16.5 17.4 
17 109 14.5 14.3 
18 101 17.2 16.6 


marked relationship between 1Q variability and CA over the age range 
as a whole, we may accept 16 points as approximately the representa- 
tive value of the standard deviation of 10's for an unselected popula- 
tion.” © As evidence to justify this position, the authors of the scale 


10 


From Terman and Merrill, op. cit., p. 40. (By Permission. ) 

" Ibid., pp. 39-40. See also M. Aborn and G. F. Derner, “IQ Variability in 
Relation to Age on the Revised Stanford-Binet,” Journal of Consulting Psychol- 
ogy, Vol. 15, 1951, pp. 231-235. 


Distribution of IQ’s 139 


present the graph shown in Figure 6.3. These distribution curves of 
composite (L and M) intelligence quotients indicate that their var- 
iability is approximately the same for the three age-level group- 
ings.” 

Proceeding on the basis of the foregoing reasoning, that the stand- 
ard deviation of IQ's is 16 points and that IQ values are comparable 
at all age levels, Terman and Merrill provide a table giving intelli- 


26 
24 : 
22|— Ages 2-5%.——N = 662 


Ages 6-12 ----N=1419 
20 |— Ages 13-18—-— ta 823 ii 


de a 


Per cent 


- 45- —$5- 75- 85- 95- 105- 115- 125- 135- 145- 155- 165- 
a Pa Bs s 84 94 104 114 124 134 144 154 164 174 
1.Q. 


Fic. 6.3. Distribution of Composite L-M IQ's at Three Age Levels. 
Terman and Merrill, Measuring Intelligence, Boston: Houghton Mifin, 


p. 4l. (By permission. ) 


f standard scores (Table 24). This table, 
mption of a “normal distribution” is ac- 
recise expression to the relative 


Sence-quotient equivalents 0 
if Terman and Merrill’s assu 
cepted, is useful in giving more P ; 
significance of any given intelligence quotient; for by means of an 
ordinary table of standard deviation values, it is possible to determine 


the approximate percentile status of any individual in the distribution 


of intelligence quotients. 


pted the argument and conclusions of Terman 
at this stage, only to present the scale and the 
tructed by its authors. 


1? Some critics have not acce 
and Merrill. It is our purpose: 
Tationale upon which it was cons 


140 The 1937 Revision of the Stanford-Binet Scale 
TABLE 24 
IQ Equivalents of Standard Scores “* 


Stanford-Binet 
Standard Score IQ Standard Score IQ 
+5.00 180 — .25 96 
+4.75 176 — .50 92 
+4.50 172 — 75 88 
+425 168 —1.00 84 
+4.00 164 —1.25 80 
+3.75 160 —1.50 76 
+3.50 156 —1.75 T2 
+3.25 152 —2.00 68 
+3.00 148 —2.25 64 
+2.75 144 -—2.50 60 
+2.50 140 —2.75 56 
+2.25 136 —3.00 52 
+2.00 132 —3.25 48 
+1.75 128 —3.50 44 
+1.50 124 —3.75 40 
-+1.25 120 —4.00 36 
+1.00 116 —4.25 32 
+ .75 112 —4.50 28 
+: 50 108 —4.75 24 
+ .25 104 —5.00 20 
.00 100 


TABLE 25 


Distribution and Classification of Composite L-M IQ’s 
of the Standardization Group 


IQ N Percent Classification 
160-169 1l 0.03 ) 
150-159 6 O ake E NTN Very superior 
140-149 32 iil 
130-139 89 3.1 ; . 
120-129 239 89 [PSEA nian Superior 
110-119 524 RSG eE 3 exe High average 
100-109 685 23.5 5 
90-99 667 23.0: e r amea Normal or average 
80-89 422 P E ae aed Low average 
70-79 164 fa NS AM Borderline defective 
60-69 57 2.0 
TE r = Sober ame Mentally defective 
30-39 1 0.03 


E From Terman and Merrill, op. cit., p. 42. (By permission. ) 


Analysis of Functions Tested 141 


SUGGESTED CLASSIFICATION OF REVISED STANFORD- 
BINET 1Q’S 
The classification in Table 25 has been provided by one of 

the authors of the 1937 revision." It will be noted that the nomen- 
clature and the percents in each of the several categories differ some 
from those of the 1916 instrument. 

Like all such tables, its purpose is primarily descriptive and, also, 
to serve as an aid in the ordering and analysis of testing results. The 
n showing an approximate distribution of 


table is valuable, as well, i 
f the range of mental ability. 


intelligence quotients throughout most O 


ANALYSIS OF FUNCTIONS TESTED 
The items of both forms L and M have been analyzed by 
factorial methods. McNemar, who made the first and major analysis, 
at each of the several age levels the items test (“are 
a common factor (8), and that this common factor 
all age levels (hence it may be called g). The 
weight of the common factor differs somewhat among the various age 
levels; but the common factor accounts, on the average, for about 40 
percent of the differences (variance) in scores—hence for about 40 
percent of the differences in performance among a group of testees. 
The statistical results also suggest the presence of group factors at 
the following ages: 2, 214, 6, 18, and possibly 7 and 11. These are 
second factors (group factors) that account for from 5 to 11 percent 
of the differences; while a third factor (another group factor) con- 
tributes from 4 to 7 percent. The group factors do not appear to be 
identical at all age levels; nor are they at all well defined as regards the 
psychological processes involved in them. Tentatively, however, Mc- 
Nemar suggests that several of these group factors, at diniezent levels, 
for designs,” “motor, verbal.” The most 


might be called “memory ; 
definite and significant conclusion, however, 1S that one factor (8) 
š orrelations of test items, with the 


is sufficient to account for the interc 


few exceptions noted. A x p a 
When complex and comprehensive (“molar” or “global”) types 


of test items are used in a scale, as in the Stanford-Binet (and quite 
appropriately so), it is not surprising that attempts to isolate group 
factors through statistical analysis yield results that are indefinite and 
uncertain, and at best tentative. The reason is that these types of items, 


igni x vised Stanford- 
H From M.A. Merrill, “The Significance of 1Q’s on the Revise r 
Binet Scales,” aako Educational Psychology, Vol. 29, 1938, pP- 641-651. 


concluded that 
saturated with”) 
is the same one at 


142 The 1937 Revision of the Stanford-Binet Scale 


being complex, involve a number of psychological processes, organ- 
ized in varying degrees and interacting. Group and specific factors will 
be found most clearly when the test items employed are fractionations 
and small segments of a whole pattern of mental functioning. But such 
fractionation can destroy “the whole” and can fail to reveal the kinds 
of mental operations with which the examining psychologist is often 
most concerned. 

The identification of a general factor in a revision of the Binet scale 
should not occasion any surprise, for it will be recalled that Binet 
himself set out to develop an instrument that should test an individ- 
ual’s general intelligence by means of sampling a variety of mental 
activities which are manifestations of such intelligence. It appears, 
therefore, that contemporary statistical analyses, applied to the age 
scale, are confirming Binet’s psychological insights. 

It was found, also—as Spearman had shown in his earlier analyses 
—that the various kinds of test items differed as regards the extent 
to which they tested (are loaded with) the general factor, The follow- 
ing listing shows which items were found to have high loadings of 
the general factor, and which had low loadings." 


AGES 2 TO 4⁄2 
High Loadings 

Picture vocabulary 
Identifying objects by name 
Response to pictures 
Comparison: balls and sticks 
Comprehension 
Opposite analogies 
Pictorial identification 
Naming materials [used in mak- 

ing various objects] 


Low Loadings 
Block building: tower 
Block building: bridge 
Three-hole form board: rotated 
Motor coordination 
Copying a circle 
Drawing a cross 
Three commissions 
Stringing beads 


AGES 5 TO 11 

High Loadings 

Pictorial likenesses and differ- 
ences 

Similarities: two things 
Vocabulary 
Verbal absurdities 
Similarities and differences 
Naming the days of the week Word naming [free association] 
Dissected sentences Word naming: animals 
Abstract words [definitions] Block counting 


Low Loadings 
Paper folding: triangle 
Patience: fitting rectangles 
Copying a bead chain 


Copying a bead chain from 
memory 


Picture absurdities 


15 From McNemar, op. cit., pp. 111-113. Š 


Analysis of Functions Tested 143 


AGES 12 TO SUPERIOR ADULT III 


High Loadings Low Loadings 
Vocabulary Problems of fact 
Verbal absurdities Copying a bead chain from 
Abstract words [definitions] memory 
Differences between abstract Memory for stories 
words Enclosed box problem 


Paper cutting [visual imagery] 
Plan of search 

Repeating digits [forward] 
Repeating digits: reversed 


Arithmetical reasoning 

Proverbs 

Essential differences 

Sentence building 

Examination of the items that have low loadings reveals that they 
test only a very limited range of functioning and that, with two ex- 
ceptions, they involve only the following processes: visualization 
(space perception and spatial relationships), visual imagery, and rote 
memory (immediate recall). All of the test items, among the low 
loadings, falling under the foregoing categories are lacking, relatively, 
in complexity and would, therefore, not have the differentiating power 
of the more complex tasks required by the items under high loadings. 
The exceptions are: word naming (random and animals), and prob- 
lems of fact, The latter is properly considered to be a test of reason- 
ing. Its low loading with g is therefore surprising, but no explanation 
is apparent or available. Random word naming tests richness of free 
association; word naming of animals tests controlled association. A 
Possible explanation of their low loadings might be that they are 
fairly routine tasks that do not require the reasoning (organization, 
analysis) demanded by the items having high loadings within the 
same range of ages (5-11). 3 T 

The foregoing listings are significant for at least two additional and 
important reasons. First, a knowledge of test items that have high or 
low loadings of the general factor enables the examining psychologist 
to make a more thorough analytical and meaningful evaluation of an 
individual's over-all test performance. The examiner is thus in a better 
Position to evaluate the strength of the general factor in a particular 
testee. This is particularly valuable if the psychological nature of the 
general factor has been determined. Second, inspection of the lists of 
items having high loading strongly indicates that the general, or com- 
mon, factor is one that involves acquisition of, use of and reasoning 
with symbols—namely, language and number—even though the test- 
ing of these begins at a very elementary level and at times utilizes 
nonverbal materials in presenting the problem (e.g., pictures, sticks). 


144 The 1937 Revision of the Stanford-Binet Scale 


The mental activities required by these test items have very much 
in commbn with Spearman’s view that intelligence is essentially the 
ability to educe relations and correlates. 

Specifically, the following processes are involved in the test items 
having high loadings: acquisition and use of vocabulary; verbal analy- 
sis of a situation; verbal and numerical concept formation; insights 
into similarities and differences (also involving concept formation); 
analysis and synthesis of materials, both nonverbal and verbal; or- 
ganization and reorganization of materials, both nonverbal and verbal. 

The list of items given above (based on statistical calculations) and 
the indicated psychological functions that are involved provide an 
illustration of how statistical and psychological analyses work to- 
gether. They also make it clear that superficial observations of dif- 
ferences between test items can be misleading as to their essential 
psychological processes. For example, the test items at early age 
levels requiring identification of objects by name or use might be re- 
garded simply as tests of information or of specific rote learning, 
whereas they actually have much in common with items that test 
“comprehension,” and which are more obviously tests of reasoning. 
Or, at a somewhat later age level, ability to define certain words 
(vocabulary test) might be regarded simply as the result of specific 
learning and verbal facility, whereas actually it has much in common 
with perception of pictorial (as well as verbal) similarities and dif- 
ferences. 

It is useful to classify test items as “information,” “word knowl- 
edge,” “perception of forms,” “reasoning,” etc.; but the point is that 
such classification does not necessarily signify that each of the subtest 
classifications measures a distinct group-factor or a special factor." 


1° McNemar’s findings are in close agreement with those of an independent 
study made in Great Britain. Cyril Burt reports that a common factor accounts 
for 42 percent of Stanford-Binet test variance, and that two subsidiary factors 
account, respectively, for 12 percent and 16 percent of test-score differences. 
The close correspondence obtained in the United States and in England gives 
additional weight to and confidence in the Stanford-Binet Scale as an instrument 
for measuring intelligence, particularly of children and adolescents. See Cyril 
Burt and Enid John, “A Factorial Analysis of Terman Binet Tests; Part I 
British Journal of Educational Psychology, Vol. 12, 1942, pp. 117-127; iB 
Part II, Vol. 12, 1942, pp. 156-161. # 

For a dissenting analysis and interpretation, see L. V. Jones, 


: 5 S “A Factor- 
Analysis of the Stanford-Binet at Four Age Levels,” Psychometrika, Vol. 14, 
1949, pp. 299-331. 


Types of Items 


TYPES OF ITEMS 


145 


Bearing this important distinction in mind, then, we may in- 
dicate the types of items included in the Stanford-Binet Scale.” 


Test Items 
Years 2-5 
Form perception and manipu- 
lation (blocks, form boards, 
stringing wooden beads) : 
Perception of differences in size 
and form 


Visual-motor operations 


Perception of relationships (in 
pictures) 

Rote memory (using digits and ens 

sentences) 


Use of words in combination 

Identifying objects by name or 
use 

Following directions 


Verbal comprehension and word 


knowledge WEN 
Understanding of “opposites 
Years 6-12 


Form perception 


Visual-motor operations 


and 


Rote memory (using digits 
sentences) 
and 


Word knowledge (concrete 
abstract) 


Functions Involved 


Visual perception and 
analysis 


Visual analysis plus 
motor development 


Visual perception 


plus beginnings of 
concept formation 


Immediate recall 


Language development 
and comprehension 


Reasoning with 
abstractions and concept 
formation 


Visual analysis 


Visual analysis plus 
motor development 


Immediate recall 


Language development 


* Since the Bellevue Scale 


utilizes many of the same n h 
affecting test performance on these items will 


types of items as the 


Stanford-Bi additional factors 
ord-Binet, additional fact the Bellevue Scale in the next chapter. 


be presented in connection Wit 


146 The 1937 Revision of the Stanford-Binet Scale 


Test Items Functions Involved 


Reasoning with 
Verbal comprehension —...... abstractions and concept 
formation 


Number concept forma- 
asap tion and reasoning with 
abstractions 


Number concepts 
Arithmetical reasoning 


Year 13—Superior Adult II] 
Visual analysis and imagery 
Perception of visual relation- 
ships 
Visual-motor operations 


Visual perception and 
enero erika analysis plus reasoning 
with non-verbal materials 


Rote memory (using digits, 
words, and sentences) “"'""* Immediate recall 


Word knowledge = — ...... Language development 


Synthesis of verbal materials 
Problem solving, using verbal  ...... 
materials 


Reasoning with 
abstractions 


Verbal analysis c f A 
Arithmetical problems -oncept formation plus 


Analysis and comprehension of ****** E with 
symbols abstractions 


THE SHORT SCALE 


It is possible to administer an abbreviated form of the scale, 
the constituent items having been specified by the authors. A short 
scale, presumably, is used when the examiner does not need as ac- 
curate an index of measurement as it is possible to obtain and when 
the necessary time is not available for a full-length examination. The 
use of an abbreviated form, however, should be discouraged, for when 
it is at all desirable to administer a test of mental ability, it would 
be very unwise indeed not to require the greatest possible accuracy. 


*“For some results obtained with the short scale, see, for example, 
G. Spache, “Methods of Predicting Results of Full-Scale Stanford-Binet,” 
American Journal of Orthopsychiatry, Vol. 14, 1944, pp. 480-482: G. Spache, 
“The Abbreviated Stanford-Binet Scale in a Superior Population,” Journal of 


Educational Psychology, Vol. 35, 1944, pp. 314-318. 


Evaluations and Criticisms 147 


EVALUATIONS AND CRITICISMS 


Every scale, quite properly, is subjected to evaluation and 
criticism on both theoretical, practical, and experimental grounds. 
The 1937 Stanford-Binet is no exception. On the whole, educators 
and clinicians who have had experience with both the old and the new 
Stanford-Binet scales are in essential agreement that the new one, in 
spite of some inadequacies, is a more useful instrument of its kind than 
the older one. The following, however, are some of the questions and 


criticisms that have been raised. 


Is the age-scale (Binet) type of test preferable to the point-scale type 
of test? The standardization of an age scale is much more laborious 
and rigid than standardization is in the case of a point scale. The re- 
sult is that even when experience and experimentation reveal certain 
defects and inadequacies in an age scale, the difficulties in making the 
indicated changes are so great as to be a deterrent to early revision. 

As instances in point, some psychologists report difficulties in in- 
terpreting responses to certain items and in scoring them. They also 
question the age placement of some items. The task of correcting 
these defects, if the criticisms are warranted, would be great." Note, 
for example, that the authors of the 1937 Stanford-Binet made six 
revisions of Form L before they were sufficiently satisfied with the age 
placement and grouping of test items, which would yield correct 
mental ages, intelligence quotients, and 1Q distributions. i 

In the case of a point scale, on the other hand, it is a much simpler 
process to revise age norms. All other things being equal, of course, 
the simpler and easier methods should be employed to achieve a de- 
sired goal in psychological testing. But simplicity and ease alone 
should not be the decisive considerations. The crucial question is 
whether the age-scale type of test or the point-scale type provides a 
Superior means of obtaining a measure of an individual's mental 
ability. Thus far, although the views of competent professional per- 
sons are not unanimous, it appears that the age scale is preferable for 
use with children and young adolescents. But when older adolescents 


“Some Impressions of the Revised Stanford- 
Binet Scale,” Journal of Educational Psychology, Vol. 30, 1939, pp. 594-603; 
H. E Garrett “The Standardization of the Terman-Merrill Revision of the 
Stanford-Binet Scale » psychological Bulletin, Vol. 40, 1943, pP- 194-201. 


For example, M. Krugman, 


148 The 1937 Revision of the Stanford-Binet Scale 


and adults are to be examined, it has often been found that point 
scales (e.g., the Wechsler-Bellevue) are preferred.” Also, there is at 
present a trend toward increased use of point scales because of their 
apparent value in clinical diagnosis by means of “scatter analysis” of 
scores on the subtests. (See Chapter 15.) 


Do the differences in variability (standard deviation) at the different 
age levels seriously impair the usefulness of the Revised Stanford- 
Binet Scale? A wholly satisfactory test of intelligence should have 
equal or very nearly equal variability at all age levels.” Data already 


TABLE 26 


1Q’s Adjusted for Variability Differences 
at Several Age Levels ** 
Adjusted IQ’s 
Obtained 2-4 to 4-10 to 11-6 to 14-6 to 


IQ's 3-3 6-6 12-5 15-5 
140 134 149 135 136 
130 125 137 126 127 
120 117 125 117 118 
110 108 112 109 109 
100 100 100 100 100 
90 92 88 91 91 
80 3 75 83 82 
70 15 63 74 73 
60 66 51 65 64 
50 58 39 56 56 


cited (Table 23) show that the standard deviations of IQ’s as found 
by Terman and Merrill were not equal at all age levels, though most 
fell close to 15-17 points. However, data have been provided to adjust 
for variability differences in IQ at age levels where such adjustments 
are necessary. Table 26 provides some examples of these adjustments. 

Table 26 is read thus: If an individual’s age is between the limits of 
2-4 and 3-3 and his obtained IQ is 140, his adjusted rating would 
be 134. 


Two aspects of this table should be noted in particular. First, the 


°° For example: F. Halpern, “A Comparison of the Revised Stanford L and 
the Bellevue Adult Intelligence Test as Clinical Instruments,” Psychiatric Quar- 
terly Supplement, Vol. 16, 1942, pp. 206-211. 

?1 This requisite is based upon the principle that the distribution of general 
ability is equal at all the age levels covered by the test, rather than being either 
irregular from year to year or changing systematically at successive ages. 

22 From Q. McNemar, op. cit., pp. 173-174. (By permission. ) 


Evaluations and Criticisms 149 


adjustments are small for IQ’s near the average (100) and larger 
as the obtained IQ’s deviate more from the average. Second, in no 
instance would the adjusted 1Q seriously displace an individual from 
one general level of ability to another, even though in a few instances 
the changes are appreciable, notably from an IQ of 50 to one of 39 


in the 4-10 to 6-6 range. 

The answer to our question 
variability on this scale do not seriously 
Revised Stanford-Binet Scale; and they need not impair its usefulness 
if the user of the scale is aware of the differences in variability at the 
several age levels and makes the necessary adjustments of IQ values. 
It is to be expected, of course, that later revisions of the Stanford- 


Binet will minimize the adjustments.” 


is, then, that differences in obtained IQ 
impair the usefulness of the 


Are the Stanford-Binet test items a variety of disconnected tests? This 
scale has been criticized at times as being only that, This criticism, 
however, is based upon a failure to take into account the theory of 
intelligence and the method of measurement upon which the scale is 
based: namely, the sampling of general ability by means of a wide 
variety of types of items in order to obtain an adequate estimate of 
the processes involved. Furthermore, the factorial analyses already 
discussed in this chapter show that the test items measure primarily a 
general factor that is common to all age levels of the scale." On cur- 
sory inspection, the items may seem to be dissociated; but psychologi- 
cal and statistical analyses demonstrate that such is not the case 1n 


fact. 
1,” type of scale (such as the Stanford- 


Binet) preferable to the factorial analyzed type, in which “factors” or 
“primary abilities” are separately tested and scored? Again, the final 
answer to this question will depend upon whether the factorial type 


Is the composite, OY “globa 


SSS 


ad i m ifferences In S' andard deviations in IQ at different age 
ben Ske a a ee both of the following: (1) unequal item difficulty 
at the several age levels: (2) inadequate samplings of the standardiza on popu- 
lation. See McNemar, 0P- cit., Chapter 8: also A. L. Baldwin, Variation in 
Stanford-Binet IQ Resulting from an Artifact of the Test, Journal of Person- 
ality, Vol. 17, 1948, PP- 156-198; J. A- F. Roberts and M. A. Mellone, On 
the Adjustment of Terman-Merrill IQ’s to Secure Comparability at Different 
Ages,” British Journal of Psychology. Statistical Section 5, 1952, pp- ark 3 

24 Even the analyses that emphasize group factors rather than a ara ae 
tor do not support the criticism the jety of dis 


nected tests.” 


at the items are merely “a var 


150 The 1937 Revision of the Stanford-Binet Scale 


of scale proves to be more valid, more accurate, and more useful than 
the composite type has been in clinical work and in educational and 
vocational guidance. 

It is not improbable, however, that psychologists will develop satis- 
factory tests of intelligence that will yield an index of general ability 
(e.g., mental age and intelligence quotient) and which, at the same 
time, can be analyzed and scored for particular aspects of intelligence 
or particular abilities. To do this, instead of grouping items by age 
level, it will be necessary to devise a number of parts, or subtests (e.g., 
verbal reasoning, numerical ability, etc.), each of which would contain 
a series of items, all items within each subtest measuring the same 
mental processes, but scaled in difficulty. Each subtest would yield a 
separate index of test performance; and at the same time the individu- 
al’s average performance would represent his general level, This is 
what the Wechsler scales (discussed in the following chapter) aim at. 

At the present time, the Binet type of test provides only the general 
indexes of MA and IQ. But—and this matter will be dealt with more 
fully later—the qualified examiner is able to make a significant quali- 
tative analysis of a subject’s performance on the scale. 


Is the Stanford-Binet Scale too heavily weighted with verbal materials? 
A criticism, heard with some frequency against both the old and the 
new revisions, is that these scales place a premium upon “verbal intel- 
ligence” and that subjects having language handicaps are penalized 
and incorrectly rated. In reply to this criticism, Terman and others 
hold that the most essential and most significant aspect of higher 
thought processes is the ability to do conceptual and abstract think- 
ing; that is, to operate with language, number, and other symbols, It 
is maintained, also, that the vocabulary test, when used with children 
from homes where English is the primary language, has higher value 
than any other part of the scale. 5 

It must be emphasized, however, that this is not so in the case of 
a child who, even though he comes from such a home, has reading 
or language difficulties due not to lack of capacity but to visu 
auditory defects or anomalies, 

In actual clinical practice, the examiner should always supplement 
an essentially verbal test of intelligence with one of the nonverbal type 
if he has any reason to suspect that the former penalizes the subject. 
There will be occasions also when it will be desirable to obtain a 


al or 


Evaluations and Criticisms 151 


rating for an individual on both types, even where no language handi- 
cap is indicated, for the purpose of comparing two or more aspects of 
a subject’s ability, While the correlation between performance on 
verbal and on some nonverbal tests of mental ability is high, coeffi- 
cients of correlation reflect group trends and relationships; and unless 
the correlation coefficient is perfect (plus or minus 1.00), there are 
always individual exceptions from the generalization that can be made 
on the basis of the coefficient; hence the need, at times, for the study 
of the several aspects of ability in an individual case. 


Is the Stanford-Binet Scale a test of school learning? As stated in an 
ust test mental ability through its mani- 
of some kind, And as Binet originally 
adapted to the environment of the sub- 
able and sound, therefore, that a scale 
designed to test mental ability primarily of American children and 
adolescents should utilize the effects of common schooling experi- 
ences, as well as effects of some other common experiences. To say 
that the Stanford-Binet or any other scale is only a measure of school 
nted. Differences in quality and extent of oppor- 
tunity to learn, in school and out, will have an effect upon intelligence- 
test scores: but such differences in opportunity do not in themselves 
account for all individual differences in ability that are found. Good 
schooling and other good environmental conditions nurture an indi- 
vidual's mental capacities and provide optimal conditions for his men- 
tal development. Thus we can say that the Stanford-Binet and other 
Scales provide ratings of intelligence. within limits of error, under 
existing school conditions. general environmental conditions, and clini- 
cal conditions, Obviously. therefore, the examiner must know the gen- 
eral developmental background of the individual he is testing, if his 
interpretation of test results is to have real validity. 

Several studies of satisfactory and unsatisfactory responses on the 
Stanford-Binet have found that bright and superior children answer 
Correctly more of the “intellectual” items than do normal or dull chil- 
dren. These items include the verbal and numerical, utilizing symbols 
“The Relative Difficulty of Stanford-Binet Items and 
Their Relation to 1Q,.” Journal of Personality, Vol. 16, 1948, pP- 417-430; 


A. Magar S “Differential Test Responses of Normal, 
agaret and C. W. Thompson. © pi 2 
Superior, and Mentally Defective Subjects,” Journal of Abnormal and Social 


Psychology. Vol. 45, 1950, pp- 163-167- 


earlier chapter, any scale m 
festations, through activity 
pointed out, the tests must be 
jects to be rated. It is reason 


learning is unwarra 


ce A. E Baldwi 


152 The 1937 Revision of the Stanford-Binet Scale 


and abstractions. This finding does not mean that bright and superior 
children have attained their ratings on mental tests only as a result of 
schooling. It is to be expected that individuals who are potentially 
above average in mental ability will be superior in dealing with situ- 
ations and problems employing language and number; for the greater 
the capacity of the individual is for mental development, the greater 
will be his ability to deal with symbols and to handle situations and 
problems at the level of abstraction. The converse has also long been 
known, namely, that one of the principal deficiencies of the mentally 
retarded and mentally defective is their inability to deal with materials 
and concepts at the levels of abstraction. It will be recalled that one 
definition states that intelligence is the ability to deal with abstractions. 
It will be recalled, also, that the educing of relations and correlates 
extends upward to the use of symbols (language and number), 


Are some of the items in the Stanford-Binet Scale obsolete? Since 
any test utilizes materials from the environment in which it is to be 
used, and since environments normally undergo change, it is to be ex- 
pected that some items in any given test will in time become culturally 


obsolete. In the Revised Stanford-Binet there are a few such items. 
For example: 


Identifying a toy steam locomotive by name 


Identifying objects (pictures) by use, such as an old-f 
kitchen stove 


Response to a picture (Messenger Boy, year 12) 


ashioned 


There are not many such items in this scale. In the case of some of 
the non-obsolete items, however, it becomes necessary, with time, to 
revise to some extent the responses that are acceptable for credit, For 
example, the following item is one such: “What’s the thing for you to 
do when you are on your way to school and see that you are in danger 
of being late?” (Year 7) When scoring an unusual response to an 
item like this, the qualified examiner is warranted in exercising his 
judgment as to its correctness; in fact, he has to do that. And his 
decision will be based upon his familiarity with the Psychological 
processes being tested by the particular item. 


Does the Stanford-Binet Scale test different abilities at different age 
levels? The answer to this question is to be found largely in the pre- 
ceding discussion of “Analysis of Functions Tested” in this chapter. 


Evaluations and Criticisms 153 


There it was stated that the two major analyses (by McNemar and 
Burt) found that the scale measures principally a general factor that 
is common to all age levels, and in addition there appear to be group 
factors at 2, 2%, 6, 18, and possibly at 7 and 11. It was pointed out 
that the items having low loadings of the general factor were, with 
minor exceptions, tests that required the use of visual perception, 
visual imagery, and rote memory. Emphasis was placed, also, upon 
the necessity of distinguishing between item-type classifications of 
tests (vocabulary, arithmetical reasoning, etc.) and basic psychologi- 
cal processes involved in each. 


Does the scale measure originality and creative abilities? The answer 
is that it does not measure these abilities, as such, to an important de- 
gree. This aspect of intelligence was discussed in Chapter 3, “Defi- 
nitions and Analyses of Intelligence.” There it was pointed out that 
the requirement of objectivity in standardizing and scoring tests prac- 
tically excludes tests of creativeness and originality, But, as also stated, 
while not all individuals who achieve high scores on intelligence tests 
y and creative ability, those who are capable of 
‘ative mental activity do generally obtain high test 
therefore, that persons having creativeness 
nd generally in the group who attain su- 
ence scales. Furthermore, although origi- 
ot be included in the prescribed objective 
ill note responses indicating these traits 
and evaluation of the ex- 


evidence originalit 
originality and cre 
Scores, It is possible to say, 
and originality will be fou 
perior ratings on the intellig 
nality and creativeness cann 
scoring, a qualified examiner W 
and will include them in his interpretation 
aminee’s performance. 


Is the 1937 Stanford-Binet Scale adequate at the adult level? The 
standardization group of this scale included individuals of 18 years, 
but it did not include an adult population. Therefore, the test items at 
the several adult levels rest upon theoretical considerations already 
Mentioned, rather than upon actual samplings of adult performance. 
One result, due perhaps to the methods used in standardizing the scale 
at superior adult levels. has been its frequently observed inadequacy 
with college students who, as a group, would be ranked above average. 
The inadequacy of the scale is especially marked when administered 
to very superior students, for it is not difficult enough at the higher 


levels of adulthood. 


154 The 1937 Revision of the Stanford-Binet Scale 


Some psychologists have also questioned whether some of the items 
and materials employed at the adult levels are of sufficient interest to 
adults. These critics believe there are a number of items that are un- 
suited to the maturity level of adulthood. Finally, there is the question 
of the soundness of using the mental-age index for adults above the 
mental level of average adult. The reason for this question has already 
been presented. Many psychologists, therefore, prefer to discard the 
mental-age index at adult levels, in favor of percentile ranks, standard 
scores, and the like. 


Is the 1937 Stanford-Binet Scale clinically useful? Judging by the 
extent to which it is employed, the answer must be a strong affirma- 
tive. Examiners find the scale useful not only for deriving a mental 
age and an intelligence quotient, but also as the “framework” within 
which a psychological interview is held. If the scale is to serve this 
purpose, the examiner must have considerable clinical experience 
skill to interpret and evaluate a subject’s responses and behavior. 

In spite of its wide clinical use, however, there is some difference in 
evaluations of the instrument among clinical examiners, Kent takes 
the position, for example, that the scale is too complex and too rigid.” 
She would have a graded series of items for each of the several types 
of tests. The clinician would then use a “battery”—or selection—of 
these, “custom-made” for each individual, to meet each special clini- 
cal problem. Separate norms and relative ratings for each graded se- 
ries would have to be provided, from which median performance 
would be determined. 

This criticism and proposal are countered by Vernon, for instance, 
who states that there is great value to a clinician in having available 
objective and standardized measuring devices, such as the Stanford- 
Binet. This scale, it is maintained, provides the clinician with an ex- 
cellent psychometric device which, at the same time, permits sufficient 
flexibility to meet the situation as may be necessary in 
case.” Many experienced psychologists agree with Verno 


and 


a particular 
n’s position. 

2# G. H. Kent, “Suggestions for the Next Revision of the Stanford-Binet 
Scale,” Psychological Record, Vol. 1, 1937, pp. 409-433; see also V, V, Flem- 
ing, “A Study of the Subtests in the Revised Stanford-Binet Scales, Forms L 
and M,” Journal of Genetic Psychology, Vol. 46, 1944, pp. 3-36. 

*7 P, E. Vernon, “The Stanford-Binet Test as a Psychometric Method,” Char- 
acter and Personality, Vol. 6, 1937, pp. 99-113. See also G. W. Parkyn, “The 
Clinical Significance of IQ's on the Revised Stanford-Binet Seale,” Journal of 
Educational Psychology, Vol. 36, 1945, pp. 114-118. This article is concerned 
mainly with individuals having IQ's under 80. 


Evaluations and Criticisms 155 


As new scales are devised—built upon different conceptions and 
theories—they will have to be subjected to both experimental investi- 
gation and practical use before their value can be compared with that 
of the Stanford-Binet. A valid judgment cannot be reached on a priori 
grounds. In the meantime, the great value of the Stanford-Binet Scale 
having been demonstrated, psychologists will continue to use it. They 
will bring to bear their insights on the interpretation of the behavior 
of the persons being examined and on the interpretation of the test 


results obtained.” 


3 The reader will find a very useful analytical table giving the content of 
the several Binet scales and of the several American revisions in G. D. Stod- 
dard, The Meaning of Intelligence. New York: Macmillan, 1943, pp. 483-497. 

In a personal communication [March, 1954] Dr. Maude A. Merrill wrote: 
“Analysis of my preliminary sample of some 800 cases [tested within the last 
five years] between five and fourteen yields results that are very heartening. 
The indications are that the scale is doing in 1954 pretty much what it did in 
1937, that there are some differences in difficulty in certain of the subtests and 
specifically certain of the items that can be changed in accordance with the 
indications by adding pretested substitute or modified items without doing vio- 
lence to the original structure of the scale. The curves showing increase in 
percent passing with age are remarkably regular and follow very closely the 
original percentages except on a rather surprisingly small number of items. 
One finding . . - was that the modern elementary-school child is handicapped 
on tests that assume the acquisition of reading skills. In 1937 we could assume 
that, on the average, by the age of ten, children had acquired reading tech- 
niques that would enable them to make use of such skills in the problem- 
solving tasks presented by certain of the subtests. Reading and Memory at the 
ten-year level, Minkus Completion at the twelve-year level and Dissected Sen- 
tences at thirteen stand out... as presenting much more difficulty than in 
1937. The obvious inference is that this finding reflects differences in educa- 
tional practices. These will need to be verified on another sample.” (Repro- 


duced with permission of Dr. Merrill.) 


J- 


MAAALALA VILIA EAAS ELELEE LAITAAN 


THE WECHSLER SCALES 


THE Binet scale and its several revisions are largely verbal in content, 
although some nonverbal items are included, especially at the early 
age levels. There are, however, other scales which are wholly, or in 
large part, of the performance or nonverbal type. This type of test is 
one in which the use of language is eliminated from test content and 
response, although directions are generally given orally. In a few in- 
stances the directions, too, are given without the use of language, by 
employing pantomime instead. 

The test materials of a nonverbal scale consist of concrete objects 
such as form boards, cubes (to be arranged in specified ways), mazes, 
geometric figures, pictures (cut up, to be correctly assembled), and 
others that will be described in later sections. The individual’s re- 
sponses take the form of manipulations, visual perceptions, and inter- 
pretations which are implied by what he does rather than by anything 
he says. 

Performance tests were first devised as a supplement to or substi- 
tute for the Stanford-Binet scale in order to examine deaf, illiterate, 
or non-English speaking subjects. Since their introduction, the use of 
nonverbal tests has been extended; for they are now utilized with chil- 
dren who have or are suspected of having reading difficulties, with 
those who have attended school irregularly and might thus have been 
handicapped in developing verbal ability, and with persons who might 
have been handicapped by markedly inferior environmental condi- 
tions. Nonverbal tests are used, also, by examiners who, for any other 
reason, believe that such a scale will yield a more complete picture 


The Wechsler-Bellevue Intelligence Test 157 


of the individual whose capacities are being analyzed and evaluated. 

The Wechsler scales, to be described in the following pages, com- 
bine verbal and nonverbal materials within a single instrument in an 
effort to obtain the advantages, comparisons, and contrasts yielded by 


both types of materials. 


DESCRIPTION OF THE WECHSLER-BELLEVUE 
INTELLIGENCE TEST * 

This scale, published in 1939, is intended to test the intelli- 
gence of persons from the age of 10 through the age of 60 years, al- 
though norms are available beginning at 7.5 years. This or a similar 
beginning level is necessary, of course, if adults and adolescents of 
lower-than-average mental levels are to be tested by means of the 
scale, 

The Bellevue scale will be presented in some detail so that the 
reader may understand the rationale of its construction, its values, and 


its limitations. 
The scale is in part verbal and in part perform- 


ner to obtain three scores and three quo- 
lligence quotient, a verbal score and its 


Content of the Scale. 
ance, enabling the exam! 
tients: a full-score and its inte 


IQ, a performance score and its IQ. 
The scale has been constructed in this form on the principle that 


intelligence involves not only ability to deal with symbols, abstrac- 
tions, and conceptual thinking, but that it also involves ability to deal 
with situations and problems in which concrete objects, rather than 
words and numbers, are utilized. The scale also is put to the pragmatic 
test of whether a given combination of items (in this instance verbal 
and nonverbal) serves the purpose of individual mental examination 
and analysis of capacities better than other combinations. The types 
of tests included in this scale are not unique. They were selected from 
available sources after a study had been made of a variety of stand- 
ardized tests then in use. The objective was the construction of an 


Measurement of Adult Intelligence, Baltimore: Wil- 
liams and Wilkins, 1944, third edition. A second form of this scale was pub- 
lished in 1946. Form II is identical with the first form in respect to underlying 
Principles and types of test materials. The specific content, of course, 1S differ- 
ent. See The Wechsler-Bellevue Intelligence Scale: Form 11, New York: The 


Psychological Corporation, 1 46 


‘Cf. D. Wechsler, The 


158 The Wechsler Scales 


effective scale for adolescents and adults, based upon already known 
and proven psychological materials and procedures. ' , 
The Bellevue scale consists of eleven parts, or subtests, six being 
verbal in content and five nonverbal.” The principal difference be- 
tween the Stanford-Binet and the Bellevue scale, in respect to arrange- 
ment of items, is this: in the former, items of various types, testing a 


ric. 7.1. A 12-vear-old girl is shown here taking the block-design 


test of the Bellevue scale. The model of the design to be copied is 
on the card before her. The examiner is timing her with the stop- 
watch in his left hand, (Acme Photo.) 


variety of functions, are grouped together at each age level 
latter, all items of one type are grouped together, constitutin 
test of the whole. In the case of the latter, the effort is made to 
the individual items within each subtest in a sequence of in 
difficulty. A description of the subtests follows. The first six ci 
the “verbal scale”; the remaining five constitute th 
scale.” 


; in the 
g a sub- 
arrange 
creasing 
onstitute 
e “performance 


2 One of the verbal subtests, vocabulary, was originally added as an alternate 
or supplement. However, it is now used quite regularly with the other subtests. 


The Wechsler-Bellevue Intelligence Test 159 


(1) Information. This consists of twenty-five items of informa- 
tion, covering a wide range. (For example, “How many weeks are 
there in a year?”) The assumptions are that the questions cover a 
wide enough range of materials to provide an adequate sampling of 
information acquired by a person who has had the usual opportunities 
of our society; that the range of an individual’s information is an indi- 
cation of his intellectual capacity; and that the more intelligent have 
broader interests, more curiosity, and seek more mental stimulation. 
This view can be valid, however, only if the subjects being tested have 
had the usual opportunities for experience and learning and if the test 
items are a valid sampling of the opportunities to acquire information, 
It is necessary to keep this caution in mind in evaluating a test of in- 
formation. Performance on information tests is susceptible, also, to 
Variations in individual motivation; i.e., some bright, self-absorbed 
persons or those unreceptive, for emotional reasons, to the offerings 
of their environments have unduly limited funds of information. Oth- 
ers, for motivational reasons quite removed from general intelligence, 
exhibit an exaggerated and misleading fund of information. For these 
very reasons, however, a test of information is a useful one for clinical 
Purposes: that is, for purposes of diagnosing personality traits. 

(2) General comprehension. This part of the scale consists of ten 
problem situations (plus two alternates), in which the subject must 
comprehend what is involved in the situations and provide answers to 
Problems presented. (For example: “Why should people pay taxes?”) 
Success on this subtest, it is held, depends upon possession of practical 
information plus ability to evaluate and utilize past experience. It 
appears, also, that ability to verbalize is a factor contributing to suc- 
cess. Tests of general comprehension are now very commonly used in 
intelligence scales; and, it will be recalled, they were included in the 
Binet scales. They have been found valuable clinically in revealing the 
thought processes, background, feelings, and emotions of the subject. 

(3) Arithmetical reasoning. This part of the Bellevue scale is 
designed to test “mental alertness.” The problems, the author of the 
test states, do not require “knowledge” (presumably arithmetical 
Skills) beyond that of the seventh-grade level. Problems in arithmeti- 
cal reasoning are widely used in tests of intelligence, since they have 
been found to have significant correlation with total scores of scales 
and to have high predictive value in respect to future evidences of 


mental ability. 


160 The Wechsler Scales 


(4) Memory span for digits, forward and backward. The subject 
is required to repeat series of digits heard once. The series vary in 
length from three to eight (backward) and nine (forward). This is a 
test of immediate recall, or immediate memory span. Psychological 
studies—both experimental and clinical—have consistently shown that 
tests of memory span, or immediate recall, of digits have a low corre- 
lation with other, more valid tests of intelligence. Yet, memory span 
for digits continues to be used because it is helpful in detecting the 
mentally defective, whose span is often very short (generally less than 
five digits forward and less than three backwards), and because very 
Poor span is useful in making certain clinical diagnoses of organic 
defects. Poor memory span for digits, especially backwards, is also 
found at times in cases of persons who are unable to apply the atten- 
tion necessary in solving more difficult mental tasks. 

(5) Similarities. This part of the scale consists of twelve sets of 
paired words; for each pair the subject is required to state the simi- 
larity or similarities that exist, (For example, orange-banana.) The 
author of the scale regards the similarities test as one of the most satis- 
factory, for it appears to sample very well the “general factor” (Spear- 
man’s g), or what is commonly called general intelligence. 

(6) Vocabulary. This test consists of forty-two words selected 
from an original list of one hundred which were chosen from a dic- 
tionary according to a sampling formula * and then experimentally 
arranged in order of difficulty. The reader is already familiar with the 
view held by many psychologists that a vocabulary test—where there 
have been no unusual developmental factors—is one of the most valu- 
able kinds of materials in deriving an index of a person’s general in- 
tellectual ability, Thus, although the vocabulary list was originally 
recommended as an alternate test in the Bellevue scale, experience 
demonstrated its value, so that it is now Suggested that this part be 
included regularly when the full scale is to be administered, Also, like 
Binet and many other psychologists, users of the Bellevue scale ob- 
served that qualitative differences in word definitions, given by various 
subjects, have clinical value in helping to reveal the nature of an in- 

*“The words . . . were taken from one of the Funk and Wagnall’s Stand- 
ard (School) Dictionaries. The list was arrived at by choosing 100 words at 
random in the following manner: Beginning with an odd Page, we selected 
every top word but one (omitting, however, obsolete, technical, or esoteric 


words) in the left-hand column of every fifth Page and continued the process 
until we had 100 words.” (Wechsler, op. cit., p. 99.) 


The Wechsler-Bellevue Intelligence Test 161 


dividual’s thought processes (their depth, extent of analysis, nuances 
of meanings, queerness of definitions, cultural background) and, in 
some instances, feelings, emotions, and values. 


(Numbers 6 through 10 are the performance tests. ) 


(7) Picture arrangement. In this subtest of the Bellevue scale, 
there are seven series of pictures. Each series is presented to the sub- 
ject in a disarranged order; but when the pictures in each series are 
placed in the correct sequence, they tell a story. This type of test, it is 
held, measures a person’s ability to comprehend and evaluate a total 
situation without the use of language. From the author’s data, how- 
ever, it appears to involve the “general factor” only to a moderate 
degree, but sufficiently to make a contribution to the total sampling. 


Ve f 


ric. 7.2. The Disassembled Hand 

(Object-Assembly Test). From the 

Wechsler-Bellevue scale. (By per- 
mission.) 


(8) Picture completion. In this part, there are fifteen cards, each 
of which shows a picture that is incomplete in some detail (for exam- 
ple, a picture of a face with the nose missing). The testee is required 
to note and name the missing part.* In some pictures the task is quite 
simple for the ordinary person; but in others the deficiencies of the 
pictures are somewhat more subtle. It has been found that this ma- 
terial is particularly valuable in testing lower-level intelligence, as well 


! Generally this type of test is called “mutilated pictures.” It was used by 
Binet in his scales and is NOW widely used in group tests, as well as in Binet 
revisions. The term Picture Completion 1s misleading, since the testee does not 
actually complete the picture. 


162 The Wechsler Scales 


as having moderate discriminative value at the intermediate levels. At 
the higher levels, however, this test is inadequate because it is not of 
sufficient difficulty. On the whole, it is said that this part of the scale 
“. . . measures the individual's basic perceptual and conceptual abili- 
ties in so far as these are involved in the visual recognition and identi- 
fication of familiar objects and forms. . . . In a broad way, the test 
measures the ability of the individual to differentiate essential from 
nonessential details.” ° 

(9) Block design. This test utilizes sixteen identical cubes, some 
or all of which are used to make nine given designs (two of which are 
for demonstration). One side of each cube is colored blue, one red, 
one white, one yellow, one red and white, one blue and white (the last 
two being divided diagonally). For each of the four easiest designs, 
only four blocks are used; for the next two, nine blocks are used; for 
the most difficult design, sixteen are used. The author of the scale be- 
lieves that this test involves ability to analyze and synthesize, He re- 
ports that scores on block design correlate well with total scores of 
the test, as they do also with the separate scores on the comprehen- 
sion, information, and vocabulary tests of the Bellevue scale. This 
would indicate that the block-design test is valuable as a measure of 
the general factor. 

(10) Object assembly. This test includes three “figure form- 
boards”; that is, three familiar objects made of wood, each one cut 
into several parts which have to be assembled to make the whole. 
The objects are a manikin, a feature profile (side view of a human 
head), and a hand. The inclusion of this test is justified by the author 
on the ground that it is desirable to have a device which requires the 
subject to perceive and reconstruct the parts of familiar objects into 
their wholes. Evidence on the object assembly tests indicates that it 
correlates poorly with other parts of the scale and has only very 
limited value in differentiating between individuals, It does, however, 
have clinical value, of a qualitative kind, in that it contributes to the 
examiner’s understanding of the subject's “modes of perception, the 
degree to which one relies on trial and error methods, and the m 
in which one reacts to mistakes.” This subtest is useful, 
diagnosing disturbance of visual perception. 


anner 
also, in 


5 Wechsler, op. cit., pp. 90-91. 
6 Wechsler, op. cit., p. 98. 


Functions Involved in the Subtests 


(11) Digit-symbol test. 


163 


The subject is shown nine divided rec- 


ta A 
ngles; in the upper half of each rectangle is a digit; in the lower half 


there is a symbol. For example, the following: 


The key is followed by 75 rectangles (of which eight are practice 


samples) in which only the nun 
subject is required to add tl 
known as a substitution test, 
of symbols and involving speed 
Fat al, visual memory. T 

, is relatively unimportant, except 
who are not accustomed to using pencil 


HE SUBTESTS 


ach of the eleven subtests may be 
elow. They indicate the processes 


FUNCTIONS INVOLVED IN T 


The functions involved in © 
eee analyzed as 
TI Serg operative in most e! 
i ne Etialysis should be distin 
and statistical procedure W 
basi reduce the number of nons 

asis of communality, and P 


Subtest 


Information 


è r 
Omprehension 


Reasoning with ab: 
testen and synthesis, W 
at 7 must first analyze thi 
create whole problem: then b 

new wholes in order to T? 


analysis 


shown bi 


hereby an at 
tatistically 


erhaps to re-n 


Functions 


Long range retention 
n and organi- 


Associatio 
zation of experience 

Reasoning with abstrac- 
tions * 

Organization of knowl- 
edge 


flective performances on è 


guished from a factorial 
tempt is made to consolidate 


analyzed functions on the 
ame them. 


nerals are given. In each instance, the 
he appropriate symbol. This test, also 
is regarded as requiring the association 
and accuracy of performance. It in- 
he purely motor factor, 
in the case of illiterate persons 
and paper. 


it has been 


ach of the tests. 
analysis, which 


Influencing Factors 


Cultural environment 
Interests 


Cultural opportuni- 
ties 

Response to reality 
situations 


stractions” 
ith the use of s 
e relationships existing 
e must reorganize 


generally involv 
ymbols—langu 
among 

and int 


ach the desired solution. 


es the processes of both 
age and number. The 
the members, or parts, 
erpret and, at times, 


164 


Subtest 


Arithmetic 


Similarities 
Vocabulary 


Digit Span 


Picture 
Arrangement 


Picture 
Completion 


Object Assembly 


Block Design 


Digit Symbol 


Functions 


Reasoning with abstrac- 
tions 

Concept formation 

Retention (of arithmeti- 
cal processes) 


Analysis of relationships 
Verbal concept forma- 
tion 


Language development 


Immediate recall 
Auditory imagery 
Visual imagery at times 


Visual perception of re- 
lationships (visual in- 
sight) 

Synthesis of nonverbal 
material 


Visual perception: anal- 
ysis 
Visual imagery 


Visual perception: syn- 
thesis 
Visual-motor integration 


Perception of form 

Visual perception: anal- 
ysis 

Visual-motor integration 

Immediate rote recall 


Visual-motor integration 
Visual imagery 


The Wechsler Scales 


Influencing Factors 


Attention span 

Opportunity to ac- 
quire the funda- 
mental arithmeti- 
cal processes 


A minimum of cul- 
tural opportunities 


Cultural opportuni- 
ties 


Attention span 


A minimum of cul- 
tural opportunity 
Visual acuity at times 


Environmental ex- 
perience 
Visual acuity at times 


Rate of motor ac- 
tivity 

Precision of motor 
activity 


Rate of motor ac- 
tivity 

Minimum of color 
vision 


Rate of motor ac- 
tivity 


While the ability to verbalize and to make abstractions is not neces- 
sarily involved in the five nonverbal subtests, nevertheless it has often 
been observed by examiners that this ability does facilitate and ex- 
pedite one’s performance, For example, on the Block-Design test, it 
is possible to analyze and formulate the color and form relationships 


Need for an Adult Scale 165 


of each design before beginning to reproduce it. In the Picture Ar- 
rangement test, some subjects will attempt to discern and state the 
story told by the group of pictures before starting to place them in 
the correct sequence. It is important to recognize this psychological 
fact in evaluating and analyzing performance on tests that are prima- 
rily nonverbal: namely, that even with such types of tests, ability to 
verbalize and abstract may be and at times is one of the psychological 


functions involved. 


NEED FOR AN ADULT SCALE 


alled that Binet’s own scales were not suited to 


It will be rec 
use with adults; nor was the Stanford Revision of 1916. And while the 


1937 Stanford-Binet is better standardized at the upper levels, includ- 
ing three “superior adult” levels, there is reason to question its ade- 
quacy in testing superior adults. There is also, as has already been 
explained, the difficulty and inconsistency of using the mental-age in- 
Its; that is, those having intelligence quotients 


dex with superior adu 
d therefore have to have a mental age rating 


above 100, who woul 


above that of the average adult. 
Intelligence testing of adults was begun on a large scale in 1917 


with the establishment of a psychological division in the United States 
Army, in World War I, At that time, the Army Alpha (verbal) and 
Army Beta (nonverbal) tests were assembled, and with them about 
one and three-quarters million men were tested. This experience in 
large-scale testing provided the impetus for the development, after 
the war, of a number of other group tests for adults. But these tests 
did not prove to be adequate, with the exception of several which were 
designed for use with selected and limited groups of our population, 
such as candidates for admission to colleges. (These will be presented 
in a later chapter.) The Bellevue scale was, therefore, developed and 


has been offered as an individual test, standardized for ages ten to 


sixty.® 
STANDARDIZATION 


The first step in stand: 
a view of intelligence which 


ardizing the Bellevue scale was to adopt 
should serve as the framework, so to 


tribution of subjects used in standardizing the scale 


* Although the age dis! 4 
y, the author gives the age range for use as ten to 


Was from seven to sevent 
Sixty. 


166 The Wechsler Scales 


speak, within which the test items would have to fit. The general-factor 
(g) theory was adopted, which requires that there should be signifi- 
cant intercorrelations between the several parts of the scale, and that 
the scale in its totality should provide a valid index of the individual’s 
general ability. 

Having accepted the theory of a general factor, the author of the 
Bellevue scale then had to determine which types of test materials 
should be included in the scale in order best to measure that factor. 
On the basis of past experience, it was decided that both verbal and 
nonverbal materials provide the most adequate and representative 
content, rather than either one alone. Having arrived at this view, the 
author and his collaborators proceeded to select the particular types 
of subtests which had been widely used by many other psychologists 
and which experience and experimentation had proved were valuable. 
In some instances, specific items, already in use, were included within 
appropriate subtests; in other instances, it was necessary to create new 
items for each of the subtests. 


THE POPULATION SAMPLE 


The next step was the selection of the population upon whom 
the scale should be standardized. The basis used was this: the sam- 
pling of adult population should be based upon the occupational dis- 
tribution of the country’s adults, as shown in the United States Census 
of 1930. The very numerous occupational subdivisions of the census 
were combined to make ten comprehensive categories (such as agri- 
culture, manufacturing, clerical, etc.). From the records of many 
adults who had already been tested and whose scores were available, 
a standardization population was selected so that their occupational 
distribution should correspond reasonably well with that of the 1930 
census. The final adult standardization group included 1081 literate 
individuals (all white) ranging in age from 17 to 70. After this adult 
population had been selected, their educational status (educational 
level completed) was compared with that of the U. S. population at 
large, the assumption being that a similar distribution of educational 
levels of the standardization group would provide further evidence of 
its representative character. In this respect, only moderate correspond- 
ence was found, for on the whole the educational level of the stand- 
ardization group is higher than that of the general population, al- 
though the range in both groups is from college graduates to illiterates. 

Since the Bellevue scale was to be standardized for some ages below 


Validity 167 


adult levels, it was necessary also to utilize a population of children. 
For the selection of a representative population of the younger ages 
(sixteen and less), the chosen criterion was the age-grade distribution 
of pupils in the public schools of New York City. About 1300 children 
were then tested in “representative and average” schools of New York 
City and about 200 in nearby communities in New York and New 
Jersey. From these 1500, a number were selected (plus a small group 
of mental defectives from an institution) so as to yield an age-grade 
distribution that would fairly well approximate that of the New York 
City school population. Thus, the final standardization population of 
the younger group included 670 white subjects, ranging in age from 
seven to sixteen years, and in grade placement from “ungraded” to 


twelve (plus continuation schools). 


VALIDITY 
The nature of the content of the scale having been decided 
upon and norms of performance for a wide range of ages having been 
determined, the scale was subjected to several analyses intended to 
, 


determine its validity. 


Intercorrelations of subtests. These coefficients are always necessary 
to provide data with regard to the presence or absence of a g factor. 
For this scale, appreciable correlations would be required, as one as- 
pect of its validity. The author reports the following coefficients (each 


subtest correlated with every other subtest ) :— 


Range of coefficients: 3 
.15 (PE = 034), Object Assembly with Digit Span (N = 355; 


ages, 20-34) ; 
72 (PE = 026), Simi 
ages, 20-34) 
37 to .72 is the range of tl 
relation coefficients (N 
coefficients found between scores on each 
th total scores of all other parts of the 


larities with Comprehension (N = 355; 


he highest three-fourths of the intercor- 
= 590; ages, 20-49) 


More significant are the 
subtest when correlated wi 
scale. These coefficients were — a 
fo cores would be correlated with the total scores 


9 For example, Information $ : 9 
of all remaining parts. If the scores on a given part—say, Information—were 


correlated with total scores of the test including the Information scores, it is 
Obvious that the resulting coefficient would be in part self-correlation; conse- 


quently it would be spuriously high. 


168 The Wechsler Scales 


Range of coefficients: 
.41 (PE = .029), for Object Assembly (N = 355; ages, 20-34) 
-73 (PE = .02), for both Similarities and Block Design (N = 
590; ages, 20-49 ) h 
-60 to .72 is the range of the highest three-fourths of the coeffi- 
cients (N = 590; ages, 20-49) 


Additional evidence consistent with the presence of a general factor 
is provided by the following data showing the correlations between 
total scores on various combinations of subtests: 


Ages of subjects, 20 to 34 years 
Number of subjects, 355 
.83 (PE = .018), for total verbal subtest scores with total per- 
formance subtest scores 
.90 (PE = .014), for four subtests combining verbal and per- 
formance with a similar combination of four other subtests 


The foregoing coefficients or intercorrelation cannot be said to be 
spuriously high, in spite of the fact that chronological age of the sub- 
jects varied considerably; for the subjects were adults, and in adult 
groups there is no continuous increase in score with successive ages, 
as there is in the case of children who are still developing in mental 
capacity. In calculating similar correlation coefficients for groups of 
children, it would be necessary to eliminate chronological age as a 
factor. If this were not done the coefficient would appear to be higher 
than it actually should be, since both variables, being correlated, 
would reflect the influence of a third factor: namely, chronological 
age. Similar statistical information obtained with a representative 
group of adolescents should have been provided in the process of 
standardization. 


Correlation with schooling. The reader will recall that beginning with 
Binet, the amount of schooling or quality of educational achievement 
—or both—were used as criteria of validity. Hence, the ratings ob- 
tained on the Bellevue scale, for the adult standardization population, 
were correlated with number of years of schooling, the coefficient be- 
ing .64. (With mental defectives omitted, r = .53.) This coefficient of 
.64 falls within the range of coefficients commonly found when these 
two variables are correlated. 


Validity 169 


Correlation with teachers’ judgments. Teachers’ estimates of pupils’ 
intelligence, rated on a six-point scale, were used as a validating cri- 
terion. For a group of adolescents (74 in number) ina trade school, 
the correlation coefficient between Bellevue 1Q’s and teachers’ ratings 
was .52; for another group (45 in number) in a general high school, 
the coefficient was .43. The number of cases in each of these instances 
was too small to be very meaningful. 


Increase and decrease of scores with increasing age. Tables of the 


scale show rises of mean scores until the age of 22.5 years, and there- 
after slow decline. The rise in mean scores to age 22.5 is quite incon- 
sistent with norms of other tests. The reasons for this have yet to be 
definitely determined. Was the standardization adult population more 
adequate than those of other tests? Or was it a selected group? Are the 
test materials more adequate and better scaled? Is the increase in 
norms an artifact due to methods of scoring? 


ndard deviation of IQ’s at various ages. 
The means are reasonably constant, ranging from 98.75 at the age 
interval 17—19 years, to 101.25 at age 10, with about sixty percent of 
the means falling between 100 and 101. The standard deviations range 
from 13.2 IQ points at age 10 to 16.85 at the age interval 50-59, 
while about half the standard deviations are between 14 and 15 points. 
(See page 136 regarding significance of constancy of means and stand- 
ard deviations of IQ's.) These data indicate reasonably good satis- 


faction of this criterion.” 


Constancy of means and sta 


Range and distribution of intelligence quotients. The range of 1508 
cases, ages 10 to 60, as shown in graph form, is from about 45 to 145. 
The distribution curve for this group is slightly skewed in a negative 
direction; “ that is, the JQ’s fall within a somewhat narrower range 
above 100 than below. The author of the scale holds that the skewness 
does not bring into doubt the scale’s validity; for he believes the wide- 
spread assumption of a symmetrical, bell-shaped distribution (Gaus- 
sian) is a “mistaken belief.” Many psychologists, however, would 
regard his curve of distribution as being a satisfactory approximation 


(SD 4 
10 The coefficients of variation ( MA 100) for the IQ's range from 13.04 


at age 10 to 15.67 at the age interval 30-34. Y 
11 Wechsler interprets his curve as being “considerably skewed. 


170 The Wechsler Scales 


to the symmetrical curve which Terman assumed as necessary in 
standardizing the Stanford-Binet scales, and which Binet himself im- 
plied as desirable in the construction of his own scales. 


Known groups. In one study, two groups were differentiated on the 
basis of total scores: namely, (1) a borderline group, having IQ’s 
between 66 and 79; and (2) a mentally defective group, having 1Q’s 
between SO and 65. The problem was to determine whether each of 
the eleven subtests contributes significantly to the differentiation of the 
two groups. Since the mean scores on each of the subtests for the two 
groups did differentiate, and since the differences between the means 
were in the directions that should be expected, it was concluded that 
each subtest did contribute to the over-all differentiation, although 
Digit Span and Object Assembly contributed relatively little. 

A second study used only the verbal subtests with naval recruits as 
subjects. The problem here was to learn whether these subtests dis- 
tinguish between (1) the mentally defective and the borderline, or 
(2) the borderline, the dull normal, and the normal. The findings 
indicated that each of the verbal subtests contributed to the differ- 
entiations between these groups, The Digit Span subtest in this in- 
stance, however, proved to be as effective as the others. This cri- 
terion of validity has still not been applied to the Bellevue scale with 
the same thoroughness as it has been applied to the Stanford-Binet. 


Correlation with the Stanford-Binet. In validating a new scale de- 
signed to test intelligence, it is common practice to correlate results 
obtained by means of the new device with ratings obtained on the 
Stanford-Binet. To do this is in effect to accept the Stanford-Binet as 
a reasonably valid and reliable scale, and hence as one sound criterion 
of validity against which to evaluate the new scale. 

Using seventy-five cases (ages 14-16), the author of the Bellevue 
scale reports that Bellevue IQ’s and Stanford-Binet 1Q’s when corre- 


1? D. Wechsler, et al., “A Study of the Subtests of the Bellevue Intelligence 
Scale in Borderline and Mental Defective Cases,” American Journal of Mental 
Deficiency, Vol. 45, pp. 555-558, 1941. 

1R. J. Lewinski, “Discriminative Value of the Subtests of the Bellevue 
Verbal Scale in the Examination of Naval Recruits,” Journal of General Psy- 
chology, Vol. 31, pp. 95-99, 1944. Also H. M. MacPhee, et al., “The Perform- 
ance of Mentally Subnormal Rural Southern Negroes on the Verbal Scale of 
the Bellevue Intelligence Examination,” Journal of Social Psychology, Vol. 25. 
pp. 217-229, 1947. 


Validity mn 


lated yielded a coefficient of .82 (PE = .026). Six correlational stud- 
Jes of Bellevue and Stanford-Binet ratings, made by others, have 
yielded coefficients ranging from 57 (PE = .04) to .93 (PE = 01). 
Although coefficients of partial correlation, with chronological age 
constant, are not given, it is doubtful if the age range in any of these 
studies is such as to have raised appreciably the size of the coefficient. 
In the two instances where the coefficients were below .80, the sub- 
jects were college freshmen; hence the relatively lower coefficients 
(57 and .62) may be attributable to either of the following condi- 
tions: the fact that the group is relatively homogeneous, with resultant 
constriction of range and reduction in correlation; the fact that the 
tests are not as reliable at the upper extreme, so that errors of meas- 
urement make high correlations unlikely. 

Comparative studies, using the Stan 


have not been as numerous as might ha 
the wide use of both as clinical instruments. Unfortunately, too—but 


understandably since the Bellevue’s clinical application has been em- 
Phasized—a large percentage of the comparative studies have used 
hospital patients and clinic referrals as their subjects, to the neglect of 
School pupils and others who fall within categories of “normal” be- 
havior and adjustment. 

The available data, however, do show that there is very substantial 
Correlation between these two scales, particularly when the Bellevue 
Full Scale 1Q’s and Verbal Scale 1Q’s are correlated with Stanford- 
Binet 1Q’s, On the other hand, as would be expected, correlations 
between Stanford-Binet IQ's and Bellevue Performance Scale IQ’s are 
only moderate. Table 27 shows the distribution of correlation coef- 
ficients found in a number of representative studies." 

Since correlation coefficients indicate relative agreement of paired 
Scores, but not absolute differences between them, it is necessary, in 
Comparing the Stanford-Binet and the Bellevue Scales, to know the 
extent of actual IQ differences existing between the correlated values. 
On the whole it has been found that the differences are not very large. 
In studies of retarded and mentally deficient persons (say, the lowest 
decile group), the Bellevue yields somewhat higher 1Q’s. At the upper 


ford-Binet and the Bellevue, 
ve been expected in view of 


n the Bellevue and group scales have yielded 
anging from 39 = .07 (PE). for Thorndike’s 
for the Henmon-Nelson. Of six other coeffi- 
the .60’s, and two in the .50’s. 


y a Correlational studies betwee 
arying results, the coefficients T 
CAVD test to .81 = .04 (PE), 
ients, two are in the .70's, two IN 


172 The Wechsler Scales 


level of mental ability, however (say, the highest decile group), the 
Stanford-Binet yields somewhat higher IQ’s. 

Comparison of IQ’s is complicated by the fact that age of testees 
is also a factor. Taking the population samples as a whole (rather than 
only the extreme groups), the following are the general findings. (1) 
Within the age range of approximately 10 to 19 years, the Stanford- 
Binet IQ’s tend to be somewhat higher. (2) From age 19 to about 35, 


TABLE 27 


Coefficients of Correlation between the Stanford-Binet 
and the Bellevue Scales 
(Frequencies) 


With With With 
T Full Scale IQ Verbal IQ Performance IQ 
-90 5 2 
85 2 2 
80 l 1 1 
75 2 1 
.70 l 
65 1 
60 il 1l 
55. 1 
50 2 
35 1 


the intelligence quotients tend to be about the same. (3) Above the 
age of 35, the Bellevue intelligence quotients tend to be somewhat 
higher.” 

In the case of any given individual, therefore, comparison of 
Stanford-Binet and Bellevue IQ’s must take into account both factors: 
ability level and chronological age. In a given instance, a person’s age 
and ability level may be such as to increase or decrease the difference 
between the IQ’s obtained with the two instruments.” 


Number (3) is just what one should expect in view of the method em- 
ployed in calculating Bellevue IQ's. The method is explained later in this 
chapter. 

16 A definitive answer to the question of the comparability of the 1Q’s of the 
two scales will have to be based upon an investigation that is representative of 
the general population rather than heavily weighted with hospital and clinical 
subjects, as is now the case. Also, such an investigation must approach the 
problem in two ways: (1) by analyzing 1Q’s separately for each of the 
groups; and (2) by analyzing the IQ's separately at each of the 
for each of the age-groups. 


age- 
ability levels 


Validity 173 


On the whole, the correlation coefficients at hand, and other com- 
parative data found between the two scales, are reasonably satisfactory 
and indicate that each has much in common with the other so far as 
concerns psychological functions being tested and ratings of ability 


levels. 


The author of the Bellevue scale, as a final test 
agmatic criterion. He states: “How do we 


of its validity, applies the pr 
know that our tests are ‘good’ measures of intelligence? The only 


honest reply we can make is that our experience has shown them to 
be so. If this seems to be a tenuous answer we need only remind the 
reader that it has been practical experience which has given (or de- 
nied) final validity to every other intelligence test... - Empirical 
judgments, here as elsewhere, play the role of ultimate arbiter. In any 
case, all evidence for the validity of a test, whether statistical or other- 
wise, is inevitably of an indirect sort and, in the end, cumulative rather 
than decisive.” 7 In other words, it has been found by the author of 
the scale and by others that it works with reasonable satisfaction in 


Clinical practice. 

In evaluating the p 
Stanford-Binet scale, oF 
Tange for which a particular i 
statement about a scale’s diagno 

Furthermore, conclusions regar 


Psychological scale are dependent upon aie 
diagnoses with which the scale’s findings are compared. Herein lies 


the major problem and weakness in attempts to determine prognostic 
p ale; for the determination of diagnostic 


efficienc ‘ hological sc 

y of a psycholos À , à 

clinical es is difficult, often unreliable, and subject to the diag- 

Nostician’s theoretical orientation. The major exception to this state- 

Ment is the diagnosis of mental deficiency, for the determination of 
es gence is the most valid 


: intelli 

which a sound individual test of general inte 

Single instrument, when administered and interpreted by a qualified 
a ? 


Psychologist. For this purpose, both the Stanford-Binet and o ae 
Vue scales have proved tO be most valuable. a moust Be a 5 at 
once, of course, that a diagnosis of mental de cien is usually na 
made upon fhe. basis of 10 and MA alone, although n some ae 
findings with a single test are so clear and unequivocal as to sullice. 


127-128. 


Prognostic efficiency. 


rognostic effectiveness of the Bellevue or the 
any other, one must bear in mind the age 
nstrument is most effective. A blanket 
stic effectiveness is not warranted. 

ding the prognostic efficiency of a 
the soundness of the clinical 


* Wechsler, op. cit PP- 


174 The Wechsler Scales 


RELIABILITY 


It is a noteworthy fact that very little research has been done 
on the reliability of the Bellevue scale. On the other hand, a great 
volume of material has been published on its’ clinical uses and inter- 
pretation, due very probably to the fact that the scale was developed 
and originally used primarily by a hospital staff of psychologists, and 
due also to the fact that it has been used most extensively by clinicians 
since its publication. The neglect of reliability studies, while research 
emphasis has been placed upon studies dealing with differential diag- 
nosis, intellectual deterioration, intellectual changes under treatment, 
etc., is regrettable; for the basic soundness of an instrument should be 
studied at least collaterally with its application.”* 

Wechsler’s manual itself reports only very meager and inadequate 
data on reliability: namely, 52 individuals retested at intervals of one 
month to one year, with the results shown in Table 28. 


TABLE 28 


Retest Correlation Coefficients for the 
Bellevue Scale ° 


Ages N Rho * EA. 
10-13 32 94 013 
20-34 20 94 018 


*Rank order correlation coefficient 


Since publication of the foregoing data, even the few available re- 
liability studies have dealt almost exclusively with abnormal subjects, 
principally psychoneurotics and schizophrenics.” Table 29 shows the 
range of correlation coefficients found with such groups for each of 
the subtests and IQ scales. 


15 The ready use of a new and promising clinical instrument is understand- 
able and justifiable, since clinicians, confronted by immediate, pressing and 
persistent problems of living human beings, cannot wait until experimental 
research has subjected the instrument to thoroughgoing tests of reliability and 
validity. In using an instrument, however, it is essential that we give due con- 
sideration to its limitations and to unanswered or only partially answered 
questions about it. 

19 From Wechsler, op. cit., p. 133. (By permission.) 

20 Because of the instability of such persons, they are not the most suitable 
subjects to use for the study of the inherent stability of a measuring instrument. 


Reliability 
175 


Examination of Table 29 shows: (1) that there is considerable 
vanatiop- an reliability among the subtests and that, in general, their 
reliability is appreciably below that of the scale as a whole; (2) that 
in all but one of the reports (in which the subjects were schizophrenics 
and r = .55) full scale reliability appears to be reasonably satisfac- 
tory, considering the instability of the groups used. (Other r’s were: 


.87, .84, .84, .87, .89, .90.) 
TABLE 2 9 


Test-Retest Reliability of the Bellevue Scale Reported 
for Abnormal Groups 
(7 Studies ) 
Range of Coefficients 


Subtest 

Information 56-.99 
Comprehension 1378 
Digit Span 50-97 
Arithmetic 6887 
Similarities 38-93 
Vocabulary .90-.93 
Picture Arrangement 49-.86 
Picture Completion .32-.89 
Block Design .65-.87 
Object Assembly 31-79 
Digit Symbol 34-91 
Verbal IQ 76-91 
Nonverbal IQ 52-94 

55-.90 


Full Scale IQ 
From published reports, it appears that only one investigation, 
using a fairly adequate sampling of individuals, has been devoted to 
reliability of the Bellevue when administered to “normal” subjects.” 
The test-retest method was employed. The age range was 20 to ap- 
proximately 50 years. One group of 60 subjects was retested after a 
one-week interval; another group of 60 persons after a four-week in- 
terval; a third group of 38 subjects after a six-month period. The 
major findings were the following. 
S A. H. Canter, “The Reliability of the 
echsler-Bellevue Subtests and Scales,” Journal of Consulting Psychology, 
Vol. 14, 1950, pp. 172-179. See also W. B. Webb and H. DeHaan, “Wechsler- 
Bellevue Split-Half Reliabilities in Normals and Schizophrenics,” ibid., Vol. 15, 
PP. 68-71, 1951. The split-half method is an inappropriate method to use with 


Most of the Bellevue scale- 


2# G. F, Derner. M- 


176 The Wechsler Scales 


The mean score for every subtest and for the total scale increased 
for all three groups. 


Increases in scores tend to be somewhat smaller as the retest inter- 
val is increased. 


The smallest average increase was 0.3 points in weighted score for 
comprehension retest (after four weeks). 


The largest average increase was 2.8 points in weighted score for 
picture arrangement retest (after one week). 


Largest average increases (2 or more points in weighted score) 
were found for picture arrangement and object assembly. 


Smallest average increases (less than one point in weighted score) 
were found for information, comprehension, and similarities, 


Average changes in IQ’s were: verbal scale, 4.4 points; nonverbal 
IQ, 9.1 points; full-scale IQ, 7.6 points. 


Retest correlations and standard errors of measurement “2 for all 
subtests and the three IQ’s are shown in Table 30. It will be noted 
TABLE 30 


Test-Retest Correlations and Standard Errors of 
Measurement for the Bellevue Scale * 


(N = 158) 

Subtests Correlations S.E. meas, 
Information 86 68 
Comprehension 14 1.21 
Digit Span 67 1.68 
Arithmetic 62 2.06 
Similarities Al 1.22 
Vocabulary 88 T 
Picture Arrangement .64 1.82 
Picture Completion .83 .95 
Block Design .84 1.10 
Object Assembly .69 131 
Digit Symbol -80 1.06 
Verbal IQ .84 3.96 
Nonverbal IQ 86 4.49 
Full Scale 10 90 3.29 


?? For interpretation of standard error of measurement, see Chapter 1, p:. 17. 
23 From G. F. Derner, et al., op. cit. 


Scoring and IQ Calculation 177 


that, for the subtests, four of the coefficients are in the .60’s (very low 
reliability); two are in the .70’s (low reliability); five are in the .80’s 
(satisfactory reliability for a subtest). 


On the whole, it appears from this study, using subjects within the 
“normal” range of personality and behavior, that the reliabilities of 
the Bellevue scale are not high enough in six of the eleven subtests to 
warrant their use independently. This is a significant finding, especially 
as it relates to the use of the subtest profile for diagnostic purposes 
(discussed in Chapter 15). The reliability coefficients for the Verbal 
and Nonverbal IQ ratings are reasonably satisfactory, though not as 
high as some psychologists would demand. The full scale 1Q reliabil- 
lty, however, is quite satisfactory. The standard errors of measure- 
ment, it will be noted, conform rather closely with the correlations. 
The general conclusion, then, is that while the subtests individually do 
Not show a high degree of reliability, the scale as a whole yields rea- 
sonably reliable results, as indicated by the correlation coefficients and 
the standard errors of measurement. 

These results indicate, again, the significance and value of measur- 
ing a number of functions; for not only does such a measure yield a 
More representative index, but there is a greater probability that daily 
or periodic fluctuations in performance will be compensated for 
through the testing of a number of representative functions. 


SCORING AND IQ CALCULATION 
Scoring. All parts of this scale are scored on a point basis. 
For some subtests, the earned raw score is simply the number correct, 
each item being scored either plus or minus (eg, information). Or, 
as in the case of comprehension and similarities, the score for each 
item is 0, 1, or 2, depending upon quality of the response. In other 
Parts, as in arithmetical reasoning and block design, the earned raw 
Score is based not only upon correct responses but upon the time taken 
to solve the problem. Thus, the factor of speed of performance is In- 
Volved in sections of this scale, especially in nonverbal subtests. ia 
The raw score for each subtest is first obtained by simple addition 
Of the credits on the items in that part. This raw score Is then con- 
verted into a weighted score (a type of standard score), by means of a 
Conversion table. The purpose of this conversion 1S the customary one 
of placing all subtest scores On @ comparable basis. The weighted 


178 The Wechsler Scales 


scores for all parts of the scale are then added to obtain the full score 
upon which the “full scale IQ” is based. Also, the weighted scores of 
only the six verbal parts are added to get the verbal score, upon which 
the “verbal scale IQ” is based. Similarly the weighted scores of the 
five performance tests are added to get the performance score and 
“performance scale IQ.” 

The following well-known formula was used for equating each sub- 
test’s raw score into weighted scores. 


S 


X,=M,+% 


D., 
D, (x, = Mih 


an arbitrarily assigned mean (10) 

an arbitrarily assigned standard deviation (3) 

the weighted score to be found 

the mean of the subtest’s raw score 

the particular raw score to be converted to a weighted score 


What this formula does is: (1) assign an arbitrary and uniform 
mean score to all subtests; (2) multiply each individual score’s devia- 
tion from its mean by a constant ratio; (3) add the result to or sub- 
tract it from the assigned mean.” By using this formula, scores are so 
converted that each individual maintains his relative status on each 
subtest. And in the case of any given person’s subtest scores, differ- 
ences between scores will be attributable, theoretically, to differences 
in his performance level rather than to differences in the weighting of 
each subtest in the total. It is thus possible to vary the number of items 
in each of the several subtests without giving any of them too little or 
too much weight in the total score upon which the IQ is based, 

On the Bellevue scale, the reason for converting raw scores into 
weighted scores is this: the possible maximum raw scores vary in the 
several subtests of the scale. If, therefore, the raw scores were simply 


2 The best way for the student to see how this formula works is to substitute 
several sets of values in it and to observe the outcome. The logic of the process 
will then be more readily apparent. For example, if the two following sets of 
values are substituted in the formula, the process will be clear, Assume the 
following data for one subtest: mean score = 12, SD = 4, ¥ = 15; while for a 
second subtest the corresponding values are 24, 8, and 30. The results will 
show the same weighted score for both subtests (12.25) because the individ- 
ual’s relative status on each subtest was identical with that on the other subtest, 
even though the raw scores differ. 


Scoring and IQ Calculation 179 


added to obtain an individual's rating on the scale, each of the parts 
would carry a different weight in the total; each part would have the 
Possibility of contributing differently to the final result—some more 
heavily than others. The raw-score units of one part of the scale would 
not have the same significance as those of other parts. If this were 
the case, then implicit in the scoring would be the assumption that 
certain of the psychological functions being tested should be regarded 
as more important than others in the total score and in getting an 


index of intelligence. The Bellevue scale, however, is scored on the 
are equally important; hence, 


principle that all the functions tested 
the part-scores should be equally weighted so that each part may con- 


tribute as much to the total as any other part. 


he verbal score, performance score, and 
d (in converted units), the three corre- 
sponding intelligence quotients are found in tables prepared for that 
Purpose, In calculating 10’s for the Bellevue scale, the basic principle 
used differs from that in the Stanford-Binet and in most other scales. 
The principle employed is that an individual's intelligence quotient 
should be determined by the relative extent to which his weighted 
Score (full, verbal, or nonverbal) deviates from the mean weighted 
score of his own age group. The detailed method of actually deter- 
Mining Bellevue intelligence quotients involves several steps and 


Calculating the IQ. After t 
full score have been obtaine 


Assumptions. met 
and the standard deviation for each 


First, the mean weighted score 
age level are calculated. Second, the weighted scores at each age level 
are converted into standard scores (Z scores. See Chapter 2, page 


45). Then, third, it is assumed that a value of .6745 of a standard 
Score shall be equated with an IQ of 90.” Hence, this assumption 
Means that the probable error (.6745 SD) of Bellevue 1Q’s is arbi- 
trarily set at 10 points: and the standard deviation of 1Q’s at approxi- 
Mately 15 points, since the PE equals 6745 of the SD." In other 
words, according to this device, in a “normal, symmetrical distribu- 


25 ee is taken is this: the index called “probable error” 
(PE Te reason why 670 7 deviation. The standard score, it will be recalled, 
is an index given in terms of the standard deviation. The PE, therefore, is 
convenient and easily calculated when the standard scores Ate known. 

26 This device makes Bellevue IQ's approximate the most probable SD of 
the Stanford-Binet distribution of intelligence quotients: namely, about 16 


Points. 


180 The Wechsler Scales 


tion of Bellevue IQ’s, fifty percent of the IQ values will fall between 90 
and 110 (+1 PE), since the PE sets the limits of the middle fifty per- 
cent of the scores of a distribution; and about two-thirds (68.26 per- 
cent) of the values will fall between 85 and 115 (+1 SD). 

On the basis of the foregoing assumption, it is possible to assign an 
IQ value to any weighted standard score. In effect, what the procedure 
amounts to is that the standard score is converted into a probable 
error value; this value is multiplied by 10; and the result is added to 
or subtracted from 100, depending upon whether the standard score 
is plus or minus. The calculations are shortened and facilitated by a 
formula for the purpose.” The intelligence quotients found in this 
manner should be called “deviation 1Q’s.” 


Criticism of the Bellevue Method. When this method of determining 
intelligence quotients is used, the following points should be noted. 
This method consistently compares (by means of IQ’s) an individ- 
ual’s test score with the mean for his own chronologic: 
to 60 years, whereas on the Stanford-Binet and other scales a person 
whose CA is greater than the age at which “average adult MA” is 
reached is compared with a given and fixed maximum age level. Thus 
it will be recalled that on the 1916 Stanford-Binet, in calculating the 
IQ of a person 16 years of age or older, the maximum value in the 
denominator is 16 (MA/16) regardless of his actual age. When using 
the 1937 Stanford-Binet, the maximum denominator is 15, In other 
words, the method used with the 1937 Stanford-Binet relates the test 
performance of a person above age 15 to the performance level of an 
average group at a specified maximum age (15); the Bellevue method, 
at all ages up to sixty, relates a person’s test performance to the aver- 
age performance of his own age group. Also, then, the Bellevue 
method, for the purpose of IQ calculation, obviates the necessity of 
determining age of “average adult MA”—always a difficult problem 
and as yet uncertain. 

One weakness of the Bellevue method of calculating IQ is this: the 
same, or constant, objective performance on the test (GE 
will rate an individual higher with increased age after the maximum, 
since his constant score will be compared with steadily declining age 
norms. For example, a person who earns an IQ of 60 at age 20, m 


al age group up 


the score) 


ain- 


27 See Wechsler, op. cit., pp. 219-220. 


Special Features of The Bellevue Scale 181 


taining a constant score to age 50, would have an IQ of 78 at age 50. 
To maintain an IQ of 60 there would need to be an average rate of 
decline consistent with an IQ of 60. 

Another point concerns the question of when an individual's mental 
level, as measured by the tests, begins to decline. This, too, presents a 
question as yet not definitively answered. There is also the question of 
the rate of decline in functions being tested during each of the several 
stages of adulthood. The Bellevue 1Q’s would not provide any evi- 
dence toward answers to either of these questions. But regardless of 
specific answers to these two questions, data show that with the Belle- 
vue scale there is some slight decline in average fest scores between 
the ages of approximately 25 and 45, and that rate of decline in aver- 
age test scores increases thereafter.” Obviously, therefore, after the 
period of decline (however moderate) sets in, as shown by the tests, 
an individual’s IQ will begin gradually to decline if the Stanford-Binet 
method is used; whereas if the Bellevue method of calculating 1Q is 
used, an individual’s rating will decline only if he loses ground with 
reference to the average of his own age group, rather than with refer- 
pothetical “average adult level.” If his losses 
hose of his age group, his Bellevue IQ 
If, however, his losses are less than 


ence to a more or less hy 
are at the same general rate as th 
will remain relatively constant. 
the general rate, his Bellevue IQ will rise. 


SPECIAL FEATURES OF THE BELLEVUE SCALE 


The Bellevue scale provides a scheme for calculating a “de- 
based on the premise that certain types of tested 
mental processes decline more rapidly than do other types, and that 
the difference between rates of decline, as between these two types, 
in the case of any given person indicates his relative degree of de- 
terioration. In other words, there are certain tested functions that 
hold up with age and others that do not hold up with age. This index 


terioration quotient” 


28 The reader should note that we have emphasized decline in “test score.” 
This does not necessarily mean that on the whole a person becomes progres- 
sively “less intelligent” even before the effects of senescence become apparent. 
While it is true that there is some loss in average test scores after about age 25, 
it is also true that some ment 


al traits, as yet unmeasured by intelligence tests, 
Increase in effectiveness through an extended period of adulthood and more 
than compensate for losses in the processes measured by current scales. This 
view is borne out by the facts regarding ages of maximum achievement of 
scholars, scientists, writers, and artists. 


182 The Wechsler Scales 


will be dealt with in more detail in a later chapter, together with other 
tests devised to measure deterioration of mental abilities. 

A second feature of the Bellevue scale is its emphasis upon “scat- 
ter analysis”—that is, analysis of an individual's performance on the 
several parts of the scale for the purpose of facilitating clinical analy- 
sis of the subject’s performance. Such analysis may lead to diagnostic 
inferences concerning personality characteristics and behavior dis- 
orders due to organic brain disease, psychosis, psychoneurosis, ado- 
lescent psychopathy, and mental deficiency. Here again, this applica- 
tion of the scale and clinical evidence, supporting and contrary, will 
be presented in a subsequent chapter on clinical uses and interpreta- 
tions of tests. 


CRITICISMS AND EVALUATIONS 


The Bellevue scale is being quite widely used in the measure- 
ment of adult intelligence, especially in psychological clinics. Implicit 
in the wide use of a testing instrument is endorsement and acceptance 


of the scale, at least as one of the more satisfactory of those available 
at the time. 


Was the population sample adequate? Since the 1751 persons upon 
whom this scale was standardized were from New York City and 
nearby communities, the adequacy of the population sample may be 
and has been seriously questioned. It would be highly desirable for 
the author of this scale, and his associates, to assemble and publish a 
frequency distribution of scores and intelligence quotients obtained 
with “normal” groups in various sections of the country. These data 
should be separately presented for each of the several age levels; they 
should be analyzed for socio-economic differences and sex differences, 
to determine if these are significant. 

The occupational distribution of the standardization population 
was a moderate approximation to the 1930 census which was taken 
as the basis for the selection of the population sample. Furthermore, 
the validity of occupational regrouping in the process of standardiza- 
tion may be questioned.” Also, in the standardization population, 
there were more persons from the upper educational levels than there 
were in the general population. 


2 See Wechsler, op. cit., p. 111. 


Criticisms and Evaluations 183 


In view of the foregoing considerations, the norms of the subtests 
and the totals should not be regarded as final. 


Are the subtests a variety of disconnected types of tests? With the 
exceptions of object assembly and, to a lesser extent, picture arrange- 
ment, the intercorrelations of the subtests are all significant and 
marked, This would suggest that the Bellevue subtests are measuring 
one or more common factors to an appreciable extent. The inter- 


correlations between verbal and nonverbal subtests are not, on the 


whole, as high as intracorrelations within each of these categories; 


but with the exceptions noted the intercorrelations are indicative of a 
common factor or common factors. 

This scale has not been adequately analyzed factorially. One analy- 
sis, however, finds that a “first factor” (general factor) accounts for 
from 27 percent to 50 percent of individual differences in scores, the 
Weight of the general factor varying apparently at different age-levels.” 
The high correlations between the Stanford-Binet and the Bellevue 
scales indicate that the two instruments have much in common as re- 


gards the psychological functions measured by them. And, it will be 


recalled, the Stanford-Binet has been found to measure primarily a 
ion of factors as determined from analysis 


general factor. The quest! : 
of Bellevue scale results themselves, is, however, one that should be 


Subjected to further comprehensive investigation.” 

unfair to some persons? The an- 
as the one given for the Stanford- 
Binet, In addition, it may be said that in this scale, the verbal ma- 
terials in Comprehension, Similarities, and Arithmetic are stated in 
Such terms as place very little premium upon educational or other 
—— 


Are the verbal subtests culturally 
swer to this question is the same 


30] Torse Third Me Yearbook, O. K. Buros, editor, New 
Bak A fea angers 1949, p. 393. See also B. Balinsky, 

À operas ae BES rious Age Groups Nine ixty, 
An Analysis of the Mental Factors of Various # eta ine to Sixty 


enetic Ps zy M raphs, Vol. 23, PP- 1 , 
are Pevctniae MoT ecah is still present even though a few factorial 


Studies ue -hed recently. These studies are, unfortunately, limited 
to Tore eames “A Factor-Analytically Based Rationale for 
the Wechsler-Bellevue » Journal of Consulting Psychology, Vol; 16, pp: 272- 
277, 1952. This re ott deals with psychoneurotlcs, schizophrenics, and brain- 
damaged Ina The findings, therefore, are not as universally applicable 
as the title implies ‘Also: J. E. Birren, “A Factorial Analysis of the Wechsler- 
Bellevue Scale Given to an Elderly Population, ibid., Vol. 16, pp. 399-405, 
1952. Since the subjects of this report were between 60 and 74 years of age, the 


findings can hardly be regarded as representative. 


184 The Wechsler Scales 


cultural advantages. Like all tests of information and vocabulary, per- 
formance on these in the Bellevue is dependent, in part, upon oppor- 
tunity to learn, whether in school, home, or through the intellectual 
exploitation of all aspects of one’s environment. 


Are some of the test items obsolete? Items in any and all tests must 
be reviewed periodically for possible obsolescence. In the case of the 
Bellevue, a number of the items, especially in comprehension, in- 
formation, vocabulary, picture arrangement, and picture completion 
should be re-examined and re-evaluated. Also for a number of items 
the satisfactory, partially satisfactory, and unsatisfactory responses 
should be reviewed and revised in the light of responses that have been 
obtained since the scale’s original publication. 


Is the factor of speed important in the Bellevue scale? Unlike the 
Stanford-Binet, in which very few test items are timed, Bellevue scores 
are significantly affected by the speed factor. Speed of performance 
yields additional credits in the following subtests: arithmetic, picture 
arrangement, object assembly, digit symbol, and block design. Thus, 
in the total score, speed of work is combined with power (or ability 
level). Although in general speed and power are highly correlated, 
it is also a fact that response time slows down with age. Thus, since 
the Bellevue scale is designed for adults, to an important degree it 
measures, especially in later adult years, decline in speed of response 
and not necessarily decline in power. This factor must be kept in 
mind when, in a later chapter, we consider the “decline” of abilities 
and the suggested “deterioration” index. 


Do the nonverbal subtests involve visual acuity? Although no ex- 
perimental data are available in answer to this question, not a few 
users of the Bellevue scale have observed that visual acuity might be 
a factor in some instances. The subtests most likely to make some 
demands upon visual acuity are picture arrangement and picture com- 


pletion. And, of course, color blindness must be eliminated as a factor 
in the block design subtest. 


Are the reliability coefficients satisfactory? Available data indicate 
that the total verbal scores and the total performance scores have a 
fairly satisfactory degree of reliability; while the full scale scores have 
a degree high enough to satisfy the standards of most psychologists. 
The reliabilities of the subtests, however, are not high enough, when 


Criticisms and Evaluations 185 


taken individually, to be used uncritically for differential diagnosis or 
for evaluating deterioration of mental ability, since accuracy in both 
of these matters depends upon accuracy of retest results. More re- 
search is necessary on the several reliabilities for a normal popula- 
tion rather than for hospital and clinical cases. 


Should the Bellevue scale use the IQ? There is warranted criticism 
of the use of the term “intelligence quotient” for the index derived 
with this scale, because the formula employed really changes the con- 
ception of the IQ and thereby confuses the meaning of the term. 
Meanings of scientific terms are established by priority and usage; 
and usage had established a conception of the IQ as developed by the 
Binet revisions and by group tests which followed the same basic 
principle as regards the maximum denominator in the formula (IQ = 
MA/CA). Since the IQ has been and is being used with the Bellevue 


scale, it is necessary to bear in mind that it is a “deviation 1Q.” 


Should the verbal and the performance scores be combined? The 
tables of norms for this test show that the maximum verbal score norm 
is reached at the age of 22.5 years, while the maximum performance 
score norm is attained at the 16.5-18 years. Maximum full scale norm 
is found at the age of 22.5 years. In view of these differences, we may 
question the wisdom of combining verbal and performance scores, 
Particularly after the age of 18, when one group of functions (per- 
formance) no longer develops differentially, while the other group of 
functions (verbal) does so develop. It is quite possible that these two 
sets of tests and functions are sufficiently different after the age of 18 
So that they should be separately scored. It is also possible that the 
value of the scale is reduced by combining the two sets of subtests. 
Furthermore, it is possible that some of the inadequacies of the Belle- 
vue Scale findings may be attributable to combining the two types of 


subtests. 


Is the Bellevue scale clinically useful? J udging from its widespread 
use in clinics and hospitals and from the empirical judgments of many 
clinicians, it appears that this scale has been of considerable value. 
Used with scientific judgment and with knowledge of its limitations 
and its tentativeness in some aspects, thie Bellevue scale can hera 
very useful instrument for estimating intelligence of adolescents and 
adults. An important qualification must be added: namely, that this 


186 The Wechsler Scales 


scale is not adequate to measure and differentiate the highest levels 
of ability; at least the upper five percent of the population. 

The Bellevue scale is a valuable addition to other testing and diag- 
nostic devices, such as the Stanford-Binet, the Arthur Performance 
Scale, the Babcock test, and others which will be presented. Between 
most of these scales there are significant correlations; the exclusive 
clinical effectiveness of one or the other has yet to be established. 
For the present, and no doubt in the future, clinicians (in schools, 
hospitals, and elsewhere) will use a given scale or a combination of 
scales as occasion demands and as their clinical insights suggest. 

Some psychologists give considerable weight to the fact that the 
Bellevue is so constructed that it is possible to analyze an individual’s 
scores in terms of his variations (consistency or inconsistency) on the 
several parts of the scale, especially since attempts have been made 
by the author of the scale and by others to specify the psychological 
functions being tested by each of the several parts. 

The validity of the Bellevue in identifying personality and be- 
havior disorders has yet to be unequivocally demonstrated. In spite 
of the fact that this is the area in which most of the evaluative studies 
of the scale have been made, clinical findings are by no means de- 
finitive. 

In attempting to diagnose personality and behavior disorders on 
the basis of the pattern or profile of scores on the Bellevue or other 
scales, it must be remembered also that different educational back- 
grounds and cultural factors, quite unrelated to personality and be- 
havior disorders, could account, to some degree, for an individual’s 
inconsistency of performance on the several parts of the scale. It has 
been found, too, that individual variations in interests, as distinguished 
from personality disorders, find expression in different patterns of 
mental activities and are reflected in subtest variations. However, the 
results obtained by means of the Bellevue, plus clinical experience 
and acumen, provide a valuable combination for study of individual 
differences and individual mental functioning. 


General comment. Since its appearance in 1939, the Bellevue scale 
has been widely reviewed and evaluated.” Judgments have varied 


32 See O. K. Buros, ed., op. cit., pp. 386-398; and Buros, The Fourth Mental 
Measurement Yearbook, Highland Park, N. J.: The Gryphon Press, 1953, pp. 
473-476. 


The 1955 Revision of the Bellevue Scale 187 


from enthusiastically uncritical acceptance to destructively critical re- 
jection. Some critics have concentrated, and justifiably, upon unwar- 
ranted assumptions and statistical inadequacies, to the exclusion of 
other and positive aspects. At the other extreme are the critics who 
have ignored the scale’s defects and limitations while concentrating 
upon and lauding its practical value. Most evaluations, however, have 
been moderate in that they have pointed out the contributions, values, 
and possibilities of the scale, while clearly indicating its defects and 
doubtful assumptions. Experimental research and competent practical 
application can proceed simultaneously, each facilitating the other. 


THE 1955 REVISION OF THE BELLEVUE SCALE 


Early in 1955 a revised edition of this scale is scheduled for publi- 
cation. This new edition, it appears, will meet some of the adverse 
criticisms directed against the original scale. The revised version (to 
be known as the Wechsler Adult Intelligence Scale) does not intro- 
content, construction, organization, 


main changes are in the revision of 


some content, extension of the population sample, and in improve- 
ment in directions for administering and scoring. Some of the major 
revisions (supplied by The Psychological Corporation prior to publi- 


cation of the scale) are the following. 


duce any new principles in its 
Scoring, or IQ derivation. The 1 


ty has been extended, chiefly downward 
for the lower level of mentally deficient 
subjects, Upward extension in difficulty has been slight. Progression 
Of difficulty from item to item has been improved. Obsolete items 
have been replaced. Items having poor “item validity” and those over- 
lapping others in content have been replaced, as have those that were 
ambiguous. Illustrations in the picture completion subtest have been 


ite uestions asked in connection with the Stanford-Binet 
sigs ee RTE asked about the Bellevue. But since the principles in- 
volved and the replies would be the same, they have not been repeated. 

Form II of the Bellevue scale has been made available (New York: The 
Psychological Corporation, 1946). This form will not be discussed because: 
(1) it is, presumably, identical with Form I in respect to underlying principles 
and types of test materials; (2) the standardization information provided is too 
limited, and the data provided indicate that Form II does not meet the criteria 
Necessary if it is to be used as an alternate scale: that is, equal or very nearly 
€qual means, deviations, and distributions obtained with the same population 
Sample, 


Content. Range of difficul 
in order to assure a score 


188 The Wechsler Scales 


more clearly drawn. The vocabulary subtest has been revised so as to 
produce a fairly normal distribution of scores for a representative 
sample of the population. Maximum scores in verbal and nonverbal 
subtests and in the full scale are reached by the 25-34 year age-group. 


Population sample. Norms are based upon a sample of 1700 per- 
sons, 850 of each sex, selected from four major geographic areas. The 
subjects ranged in age from 16 to 64 years. The age range was di- 
vided into seven age-groups, within each of which the numbers were 
proportioned according to the 1950 U. S. census with respect to geo- 
graphic area, race (white and nonwhite), occupation, urban-rural, and 
years of formal education. Supplementary data were also obtained for 
a sample of older persons (N = 352) above 65 years of age. 


Reliability and Validity. For the separate subtests, the reliability 
estimates range from .66 (picture arrangement) to .96 (vocabulary). 
Reliabilities for the three IQ’s are: verbal scale, .96; performance 
scale, .93; full scale, .97. Intercorrelations among the eleven subtests 
range from .30 to .85, while the correlation coefficient for total verbal 
scores vs. total performance scores (ages 18-24) is .77. 


THE WECHSLER INTELLIGENCE SCALE FOR CHILDREN 
(1949) * 


Description. This scale for children from five through fifteen 
years of age is built on the same principles and in the same form as 
the Bellevue scale for adolescents and adults: verbal subtests, per- 
formance subtests, a verbal IQ, a performance IQ, and a full scale IQ. 

The subtest types are identical with those of the older scale, with 
the following exceptions: digit span is made optional; an optional 
maze test has been added; and in place of digit-symbol, a coding test 
has been substituted, in which various lines in varied positions (single, 


double, circle) are associated with geometric figures (star, circle, 
triangle, cross, rectangle). 


Standardization Population. The scale was standardized on a sample 
of 100 boys and 100 girls at each of the eleven age levels, each child 
being tested within one and one-half months of his mid-year, In ef- 


34 New York, The Psychological Corporation. 


The Wechsler Intelligence Scale for Children (1949) 189 


fect this means that children were selected at the half-way mark be- 
tween birthdays; and half-way was defined as being between four and 
a half months and seven and a half months (excepting the feeble- 
minded, nearly all of whom were within two months of their mid- 


year). 

Selection of the 2200 children was based upon: (1) rural-urban 
residence; (2) father’s occupation; and (3) geographic area. The 
proportions in these sampling factors were based upon U. S. Census 
data for 1940, “. . . with some adjustment for the shift of popula- 
tion toward the West.” In the final selection of the standardization 
sample, geographic area percentages are reasonably well satisfied; 
urban-rural percentages, less well; and father’s occupation percent- 
ages, moderately.** On the whole, the standardization group satisfies 
the principles of sampling better than did the population sample used 
for the Bellevue adult scale. Still, 2200 children distributed among 
eleven age groups and over four very wide geographic areas (New 
England and Middle Atlantic States; North Central States; South 
Atlantic and South Central States; Mountain and Pacific States) are 
Extensive experimental use of this scale will be 


but a small handful. 
ne the adequacy of the norms based upon the 


necessary to determi : 
Population sample used in standardization. 


Reliability coefficients were found for three age 
14), the number being 200 in each. The find- 
ings are summarized in Tables 31 and 32. It will be noted, from these 
data, that the subtest reliability coefficients vary markedly and are, 
on the whole, only moderate in magnitude. The IQ reliabilities, how- 
ever, being from 86 to .96, fall within the range that is generally 
acceptable. These data demonstrate again the necessity of distinguish- 
ing between reliability of part of a scale and that of the whole scale. 


Reliability Data.” 
groups (744, 10%, 13 


see Wechsler, op. cit.; and H. Seashore, et 


wF ailed standardization, i, = 
or detailed sta Wechsler Intelligence Scale for Children,” 


al, “The Standardization of the € 
Journal of Clinical Psychology. Vol. 14, pp. 99-110. 1950. } 
#6 The split-half technique was used to calculate reliability, except in the case 


of coding ¢ igit span. For the former, results of coding tests A and B were 
eerie ak Rag a ee Pr a speed test. For digit span, scores on digits for- 
ward were correlated with scores on digits backward—a very questionable pro- 
cedure since the two do not involve identical processes. Also, some question 
May be raised regarding the appropriateness of the split-half method for some 
Of the other subtests. The test-retest method is much to be preferred for a 


Scale of this type. 


190 The Wechsler Scales 


The standard error of measurement indicates the range of score 
within which the chances are approximately two to one that a sub- 
ject’s “true” score will fall in that particular subtest. Thus, the stand- 
ard error of 1.20 for 7/2-year-olds on picture arrangement indicates 
that the probabilities are two to one that an individual’s “true” score 
on this subtest is within 1.20 points of his obtained weighted score. 
Likewise the standard error of 4.25 IQ points (full scale) for 71⁄2- 
year-olds indicates that the probabilities are about two to one that an 


TABLE 31 
Reliability Data: Intelligence Scale for Children * 


Subtest 
Age Group Range of rs Mean High r Lowr 
WM 59-84 67 Block Design Comprehension 
and Picture 
i Completion 
10% 59-91 .76 Vocabulary Digit Span 
13% -50-.90 75 Vocabulary Digit Span 
IQ Reliabilities 
Verbal Nonverbal Full 
7% 88 86 92 
10% .96 .89 95 
13% 96 90 94 
(Digit Span, Coding, and Mazes are not included.) 


individual’s “true” IQ on this scale is within 4.25 points of his ob- 
tained IQ. 

Conclusions on Reliability. The reliability coefficients and the 
standard errors of measurement must be taken into account when 
scores on the individual subtests are being interpreted or when dif- 
ferences in scores between subtests are being evaluated. The lower 
the reliability and the larger the standard error, the less is the con- 
fidence to be placed in judgments based upon scores of that particular 
subtest. 

Since, on the basis of the standardization data, the reliabilities of 
the several IQ’s are at a satisfactory level, it appears that considerably 
more confidence can be placed in those indexes than in the scores of 
the individual subtests (with the exception of vocabulary), 


87 From the Manual, p. 13. The Psychological Corporation. (By permission.) 


The Wechsler Intelligence Scale for Children (1949) 191 


Since there are marked differences between reliability coefficients 
of the subtests for each of the three age groups reported in the stand- 
ardization data, it is highly desirable that separate reliability studies 
be made for each of the eleven age groups separately, especially at 
the extremes of the age distribution (5 to 15) for which the scale is 
intended. 


TABLE 32 


Standard Errors of Measurement: Intelligence Scale 
for Children * 


Subtest 


Age Group Range * Mean High Low 
71” 1.20-2.45 1.74 Digit Span Picture Arrange- 
ment 


10% 90-1.92 144 Digit Span Vocabulary 


13⁄2 ‘95-212 147 Digit Span Vocabulary 
IO Standard Errors ** 

Verbal Nonverbal Full 

7” 5.19 5.61 4.25 

10% 3.00 4.98 3.36 

13% 3.00 4.74 3.68 


* Standard errors of measurement of subtests are given in units of the 


weighted scores. 
** Standard errors oi 
points. 

Validity. Subtest Intercorrelations. In the manual for this scale, 
there are no data on the problem of validity as such. There are data 
on intercorrelations of the subtests. The assumption is that significant 
intercorrelations between subtests would validate the hypothesis that 
they and the scale as a whole measure common factors. However, the 
intercorrelation coefficients among the individual subtests are, on 
the whole, not as high as would be expected. At the 74-year level. 
these coefficients arë concentrated within the .20's and .30’s; at the 
10% year level, they are concentrated within the .30’s and .40’s; 
while at the 1312 year level, they are distributed within the .20's, 


f intelligence quotients are given, of course, in TQ 


30's, and .40’s. ee 

On the other hand each verbal subtest correlates quite significantly 
With total verbal score, the range for the three age groups being from 
e Psychological Corporation. (By permission.) 


33 From the Manual, p- 13» Th 


192 The Wechsler Scales 


-44 to .82, with the coefficients fairly evenly distributed over this 
range. The nonverbal subtests correlate somewhat lower with total 
performance scores, the range being from .32 to .68, with some con- 
centration in the .50’s. 

The correlation coefficients between total verbal scores and total 
performance scores are, respectively, .60, .68, and .56 for these same 
age groups. 


TABLE 33 


Correlations between the Intelligence Scale for 
Children and Other Scales 
(5 Studies) 


‘OtherSeale Subjects Number T 
Arthur Point Scale mentally defective 40.79 (Full Scale) 
A i af “ 40.83 (Nonverbal Scale) 
a “ “ “ 40 47 (Verbal Scale) 
Stanford-Binet, L ji y 40 .76 (Full Scale) 
a te “ ae 40 .64 (Nonverbal Scale) 
A a “ # 40 .75 (Verbal Scale) 
Stanford-Binct subnormals 70.68 (Full Scale) 
“« “ “ 70 69 (Verbal Scale) 
Stanford-Binet normals 49-53 .85 (Full Scale) 
ee iv “ 49-53 82 (Verbal Scale) 
iè “u 49-53 .80 (Nonverbal Scale) 
Arthur Point Scale Gi 49-53 .80 (Full Scale ) 
z ‘ “ “ page a (Verbal Scale) 
. “ rs -53 81 (Nonverba z 
Stanford-Binet, L * 54 .80 (Full Seale) o 


54 .71 (Verbal Scale) 

5+ .63 (Nonverbal Scale) 
332 .82 (Full Scale) 
332.74 (Verbal Scale 
332.64 (Nonverbal Scale) 


These findings indicate that, on the whole, while each subtest has 
only a very moderate amount of communality with the others taken 
singly, verbal subtests combined have much more communality with 
each individual verbal subtest.” The same is true of combined per- 
formance and separate performance scores. 

Finally, the data indicate that all the verbal subtests taken as a 
whole have considerable communality with all the performance sub- 


#° Corrections are made in order to eliminate self-correlation. 


The Wechsler Intelligence Scale for Children (1949) 193 


tests as a whole. Yet, since the aforementioned coefficients of .60, 
-68, and .56 are fairly distant from unity, the measured abilities in 
one group (verbal) can be used only for a general approximation of 
abilities measured by the other group of subtests (nonverbal), and 
vice versa, The reporting, therefore, of verbal, nonverbal, and full 


TABLE 34 


IQ's of Intelligence Scale for Children Compared with 
Two Other Scales 
Means and Standard Deviations 


(5 Studies) 
Arthur Point 

WISC Scale S-B Subjects N 
60(S.D.6) Pull 
65(S.D.13) Verbal $ 65(S.D.12)  56(S.D.5) Deficients 40 
58(S.D.10) Perform. 
56(S.D.9) Full 
67 S.D.7) Verbal $ 65(S.D.7) Subnormal 70 
72(S.D.11) Perform. 
100(S.D.15) Full) 
99(S.D.14) Verbal 95(S.D.16) 105(S.D.15) Normal 49-53 
101 (S.D.15) Perform. 
102($.D.11) Full 
101 (S.D.12) Verbal 106(S.D.11) Normal 54 
104(S.D.1 1) Perform. 
101($.D.13) Full 
103(S:b:14} oa \ 108(S.D.16) Normal 332 
98 S.D.15) Perform. 


Scale IQs with this instrument is a desirable, in fact a necessary, 
Practice, 

Correlations with Other Scales. Since the appearance of this scale, 
Several reports have been published that deal with the correlations 
and IQ differences found between it, the S-B and the Arthur. The 
Summarized data are given in Tables 33 and 34. Unfortunately, 
the findings of these studies must, with one exception, be regarded as 
Only suggestive and quite tentative; for the number of cases in each 
IS very small, or the coefficients have been affected by the age range 


Of the testees, 


The exception reports on 332 cases between the ages of 5 and 15. 


194 The Wechsler Scales 


The data given in Tables 33 and 34 are for the entire group. At the 
different ages, the r’s between Stanford-Binet and full scale 1Q’s vary 
from .75 to .90; for the verbal scale, between .65 and .90; for the 
performance scale, between .50 and .75. The table giving mean in- 
telligence quotients and standard deviations indicates that the Wech- 
sler scale tends to rate subnormal subjects somewhat but not markedly 
higher than does the Stanford-Binet. At the average level the reverse 
is true. The differences between the means are fairly marked in the 
study of 332 individuals. This being the most comprehensive and de- 


TABLE 35 


Correlations of Intelligence Scale for Children with 
School Achievement 


Range of r's for 


Scale N Separate Subjects Total Achievement Test 
Full 54 A5-.71 76 
Verbal 54 48-.60 62 
Nonverbal 54 41-.64 65 
Full 18-21 A481 ea 
Verbal 18-21 47-74 - 
Nonverbal 18-21 .29-.74 = 


tailed report of those herein reported, its findings carry the greatest 
weight.*° 

On the basis of the research thus far reported, it is reasonable to 
conclude that full scale intelligence quotients and verbal scale in- 
telligence quotients, on the one hand, and Stanford-Binet 1Q’s, on the 
other, have considerable communality of psychological functions be- 
ing measured. The performance scale intelligence quotients have much 
less in common with the Stanford-Binet. 

Predictive Efficiency. Validity of a test for children should also be 
evaluated in terms of its predictive efficiency with respect to educa- 
tional achievement. In this area, too, few data are available for this 
children’s intelligence test. Table 35 summatizes the results reported 
in two studies. 

Correlations of IQ with teachers’ ratings of their pupils’ intelligence 
were .68 (full), .64 (verbal), and .53 (performance). 


40 J, I. Krugman, “Pupil Functioning on the Stanford-Binet and the Wechsler 
Intelligence Scale for Children,” Journal of Consulting Psychology, Vol. 15, pp. 
475-483, 1951. 


The Wechsler Intelligence Scale for Children (1949) 193 


Conclusions on Validity. On the whole, these results are encourag- 
ing; the correlation coefficients fall within the approximate range of 
indexes usually found for other widely used tests of intelligence, in- 
cluding the Stanford-Binet. These and similar data are, however, not 
yet definitive; the number of cases is small in both studies; and in 
one of the studies, grade level and age range were not sufficiently 
controlled. Further research is necessary at each of the age and grade 
levels for which the scale is intended before generalizations may be 
offered with assurance regarding the scale’s predictive efficiency in re- 


spect to educational achievement. 


Evaluation and Criticisms. On the whole, this intelligence scale for 
children is a useful addition to the very limited number of instru- 
ments now available for individual testing. It is to be expected that 
the deficiencies regarding standardization population and studies of 
reliability and validity will be remedied as the scale continues to be 
used for both practical and research purposes. Some psychologists, 
at this stage, question the wisdom of using this Wechsler scale as a 
substitute for the Stanford-Binet until more definitive data are ob- 
tained for the former. A 

Although one of the advantages originally claimed for this scale 
was that it did not use the mental-age concept, it has since been found 
desirable to supply mental-age equivalents." The mental age concept 
is an extremely useful one when interpreted by qualified psychologists. 
This concept should be made an integral part of the scale, 

The relatively low reliabilities of the swbrests indicate that there is 
No merit in merely deriving a test profile for purposes of diagnosis 
and guidance. The reliabilities of part-scores must be high before 
Profiles can be used with confidence. The total verbal, performance, 
and full scores, however, have yielded reliability coefficients at a satis- 
factorily high level of confidence. a 

Considerably more research remains to be done on the predictive 
efficiency (validity) of this scale. Data available thus far show that 
IQ differences between it and the Stanford-Binet are significant 
enough to warrant caution, in spite of the generally high correlations 
found between them; and, it must be added, some of the coefficients 


Test and Mental Ages for the WISC,” Journal 


41 “ ivalent 
D. Wechsler, ` Bauiva vol. 15, pp. 381-384, 1951. 


of Consulting Psychology, Vol- 


196 The Wechsler Scales 


between the two scales are only moderate.** The Stanford-Binet IQ’s 
tend to be higher within the “normal” range, especially at the earlier 
age levels. Since the Stanford-Binet has been in use much longer, and 
has been found to have considerable value in schools and clinics, and 
has been widely used as a validating criterion, it is probable the ob- 
tained discrepancies between the two scales will stimulate research on 
and improvement of the more recent instrument. 

The limits of the IQ values given by the Wechsler full scale are 
from 46 to 154. This means that the scale cannot be used with in- 
dividuals who rank above or below these limits. In terms of total 
number of persons, the percentages of such cases will be very small; 
but in particular instances this can be a serious limitation, 


#2J, J. Pastovic and G. M. Guthrie, “Some Evidence on the Validity of 
WISC,” Journal of Consulting Psychology, Vol. 15, pp. 385-386, 1951. A par- 
ticularly significant finding in this report is that, on the Wechsler scale, the 
mean performance IQ’s are higher than the mean verbal IQ's. If these results 
are corroborated, they will raise some doubt over the applicability of the scale 
to the usual problems of predicting and evaluating educational achievement. 


8. 


Aw 
E E EEEN O ANS a A E a NTT” 


INDIVIDUAL PERFORMANCE SCALES 


ABOUT the time the final revision of Binet’s own scale appeared, 
Some psychologists in the United States had assembled a group of 
Performance tests intended to meet practical problems in the study 
of human abilities and behavior. One of these was the Healy-Fernald 
Stoup of tests, devised primarily to examine juvenile delinquents in 
an effort to determine their intellectual levels and personality traits 
revealed in the course of the examination.’ Unlike tests which were 
Subsequently developed and which are now in use, those of Healy 
and Fernald were not actually standardized in respect to administra- 
tion and scoring. This group of tests provided the psychological exam- 
iner with situations wherein he could observe, evaluate, and interpret 
the testee’s methods of solving problems and his behavior in test situa- 
tions. The eae tests were selected on the basis of Healy’s and 
~ernald’s judgment and psychological insights as to what constitutes 
Intelligent activity; beyond this the value of the results obtained with 
their test would Ae end upon the clinical acumen of examiners, since 
there were no ae based upon standardization procedures. While 
the Healy-Fernald tests are infrequently used today, they are im- 


Portant in the historical development of performance scales, ; 
Several of the older and of the current scales will be described; 
these descriptions will be followed by a discussion of their uses and 


by a general evaluation of this type of instrument. 


Ww. Healy and G. M. Fernald: Tests for Practical Mental Classification, 


Ychological Monographs, VOl- 13, No. 2, 1911. 


Ps 


198 Individual Performance Scales 


‘THE PINTNER-PATERSON SCALE OF PERFORMANCE TESTS 


Contents. This group of performance tests, the first to be 
organized into a scale, is now of interest principally for its historical 
and background value.” It will also enable the student to see that 
many of the earliest types of performance tests have survived the 
years of experimentation and application and have been incorporated 
into scales now current, including the Bellevue. 

Pintner and Paterson standardized some of the Healy-Fernald per- 
formance tests as well as several which had been devised by other 
psychologists and themselves. The final scale includes fifteen tests 
which can be presented without the use of language; nor do they 
require the use of language on the part of the subject. They are in- 
tended primarily for use with persons having serious hearing defects 
and for non-English-speaking individuals. These and similar perform- 
ance tests have been found valuable as supplements to verbal tests 
of mental ability, and also with subjects who, though they are English- 
speaking, have speech defects or reading disabilities, 

The subtests in the scale are described below. 


(1) Mare and Foal Form Board. This, of the picture-puzzle 
type, is a pictureboard of a mare and foal, in color. Sections of the 
board are removed to begin with; the subject must replace them cor- 
rectly. Score is based on time required and number of wrong moves. 

(2) Seguin Form Board. This is a form board in which ten com- 
mon geometric shapes are to be placed. Score is based on the shortest 
time required in three trials. 

(3) Five-Figure Board. There are five geometric figures, each of 
which is divided into two or three parts. The pieces are to be fitted 
into their appropriate places. Score is based on time required and 
number of errors made. 

(4) Two-Figure Board. There are two geometric figures, one cut 
into four sections and the other into five. These are to be correctly 
placed in two spaces. Score is based on time required and number of 
moves. 

(5) Casuist Board. This form board—more difficult than the 
preceding ones—consists of four spaces in which twelve sections 
have to be fitted. Score is based on time required and number of 
errors made. 


? R. Pintner and D. Paterson, A Point Scale of Performance Tests, New 
York: D. Appleton, 1917. 


The Pintner-Paterson Scale of Performance Tests 199 


(6) Triangle Test. Four triangular pieces are to be fitted into the 
board. Score is based on time required and number of errors made. 

(7) Diagonal Test. Five variously shaped sections have to be 
fitted into a rectangular form. Score is based on time required and 
number of errors made. 

(8) Healy Puzzle A. This consists of five rectangular sections 
which are to be fitted into a rectangular frame. Score is based on 
time required and number of moves made. 

(9) Manikin Test. Wooden legs, arms, head, and body are to be 
put together to make the form of a man. Score depends on quality of 
performance. 

(10) Feature Profile Test. Wooden sections have to be put to- 
gether to form the profile of a man’s head. Score is based on time 
required. a à 

(11) Ship Test (originated by H. A. Knox). This is a picture of 
a ship cut into ten sections, all of same size and shape, to be inserted 
properly in a rectangular frame. Score depends on quality of per- 


formance. 


(12) Healy Picture-Completion Test I. This is a large picture 


from which ten small squares have been cut out. The missing parts 
are to be selected from among forty-eight squares identical in size. 
Score depends on the quality of completion within a limit of ten 


minutes. 

(13) Substitutio 
different shapes) W 
to correspond with a key 


time and errors made. 3 : 
a EA Board. This is a form board having four circular 
blocks and holes; three are 6.8 cm. in diameter, while the fourth is 


7 cm. The subject is shown that one block fits the larger hole. He is 
then required to keep his attention fixed and to fit this larger block 
into the correct space when the board is moved into four different 
positions. Score is based on the number of correct moves. 

(15) Cube Test. Four cubes (one inch) are placed before the sub- 
ject. With a fifth cube they are tapped in a specified order by the 
examiner. The subject is asked to imitate the order of tapping. The 
sequence becomes longer and more complex. Score is the number of 


sequences correctly imitated. 


n Test. A page of rows of geometric figures (five 
hich have to be marked with appropriate digits, 
at top of page. Score is a combination of 


For general testing purposes, the authors of this performance scale 
recomend the use of a short scale which includes ten of the fifteen 


parts: namely, 1, 2, 3 4, 5, 9, 10, 11, 12, and 15 of the foregoing 


list. 
The age range of the Pin 
fifteen, However, this does not m 


tner-Paterson scale is from four years to 
ean that every test in the series has 


200 Individual Performance Scales 


discriminative value throughout this range. For example, the Seguin 
Form Board does not have value in general beyond age ten, while the 
Feature Profile test is not generally useful below age ten. 

Scoring. Three different methods of scoring were provided by the 
authors: median mental age, point score, and percentile rank. For 
each test there is a separate table of mental age norms; the median 
of an individual’s MA’s on each of the several subtests is taken as 
the single MA to represent his general performance on the entire 
scale. In the point scale, the subject earns a total score in points for 
all the parts; the total score determines his MA, as indicated in a table 
of norms. By the percentile method, the subject’s score on each of the 
several tests yields a percentile score; these can be combined to yield 
a single percentile rating for the whole scale. Of the three indexes, 
the median mental age has been used most widely with this scale. 


Evaluation. During the many years that the Pintner-Paterson per- 
formance tests were used for clinical and experimental purposes, the 
following evaluations of them were widely accepted. They are more 
susceptible to practice effects, and chance successes are more frequent, 
than is the case with verbal tests; hence, the reliability coefficients of 
these performance tests are not as high as those of the verbal. This 
scale is useful primarily with young children, and with older children 
and adults who are mentally retarded or deficient. The scale has 
clinical significance, also, in the case of an older child when there is 
marked discrepancy in performance on the several subtests, The sev- 
eral parts of the performance scale examine processes that are more 
specific than those examined by verbal tests. This is indicated by the 
considerable scatter of ratings on the several parts and by the lower 
correlation coefficients found when ratings on each separate part were 
correlated with ratings for the whole scale, the range being from very 
negligible coefficients to fairly high, with a median at about .50. 
Performance tests of the Pintner-Paterson kind correlate poorly or 
only very moderately with intelligence tests of the verbal kind as 
represented by the Stanford-Binet, when the group being studied is 
limited with respect to age or range of ability. For instance,” when 
a group of gifted children was examined with both scales, the correla- 


a D. A. MacMurray, “A Comparison of Gifted Children and of Dull-Normal 
Children Measured by the Pintner-Paterson Scale as Against the Stanford- 
Binet Scale,” Journal of Psychology, Vol. 4, 1937, pp. 273-280. 


The Pintner-Paterson Scale of Performance Tests 201 


tion coefficient between the two sets of ratings was only .23. For a 
group of dull children who were likewise examined, the coefficient 
was .43. Assuming adequate reliability of both scales, these coeffi- 
cients indicate that there is only little or moderate correspondence 
between them, regarding functions being tested. Furthermore, the 
very low coefficient of .23 suggests, also, that the performance scale 


c n stoeutina cv 


Performance Tests. C. H. Stoelting Company. 
(By permission.) 


Fic. 8.1. Pintner-Paterson 


is particularly inadequate for differentiating among performance levels 


of gifted children. s . 7 
A fairly large number of studies have been published reporting 


higher correlations between performance-test ratings, obtained with 
the Pintner-Paterson and similar tests, and those obtained with re- 
visions of the Binet scale. Coefficients of the order of .70 and .80 
were not uncommon, whereas others were as low as .50. These co- 
efficients, however, cannot be interpreted as necessarily indicating 
that there is a considerable community of function between these 
performance tests, on the one hand, and verbal tests of ability on the 


202 Individual Performance Scales 


other. Indeed, it appears that to a considerable degree the correlation 
coefficients are due to the wide age-range of the subjects tested, with 
the result that the coefficients reflect the fact that the psychological 
functions being tested by both types increase with age; that is, the re- 
sults on both types of tests are to an appreciable extent the product 
of age. An ordinary group of ten-year-old children will get higher 
scores on both types of tests than will a similar group of nine-year- 
olds, who, in turn, will score higher on both than an ordinary group 
of eight-year-olds; and so on. This is to be expected; for the tests 
have been so constructed as to yield progressive increases in age- 
norms as chronological age increases. 

Another example of the effect of age-range on correlations is found 
in the coefficients between intelligence ratings and height or weight 
or dentition. These, for a wide age-range, are in the neighborhood of 
.50 and .60, because in general, older children are taller, heavier, and 
have more permanent teeth. It would not be said, of course, that these 
coefficients indicate community of function between the psychological 
tests and the physical measures. But within a single age-group the 
correlation coefficients between these physical traits and intelligence 
test ratings drop down to negligible levels. Thus, when age is held 
constant, or very nearly so, the correlation coefficients between re- 
sults obtained with the Pintner-Paterson and similar performance 
tests, on the one hand, and verbal tests, on the other, drop to between 
-40 and .60. 

A factorial analysis of results obtained on thirty-four commonly 
used performance tests suggests one reason why there is only a low or 
moderate correlation between these and verbal tests of mental abil- 
ity. This analysis appears to indicate that the principal factors meas- 
ured by the performance tests may be identified as “spatial, perceptual 
speed, and induction.” * While the first two of these functions are 
involved to some extent in many verbal tests of ability, 
actually of relatively little significance there in the determi 
an individual’s rating. 

These findings signify that verbal scales and those of the Pintner- 


they are 
nation of 


*C. M. Morris, “A Critical Analysis of Certain Performance Tests,” Peda- 
gogical Seminary and Journal of Genetic Psychology, Vol. 54, 1939, pp. 85- 
105. “Perceptual speed” is the readiness to discover and identify Perceptual de- 
tail (mainly visual). The “spatial” factor involves the ability to manipulate ob- 
jects in space. 


The Cornell-Coxe Performance Ability Scale 203 


Paterson type may not be used interchangeably but should be used 
to supplement each other. 


THE CORNELL-COXE PERFORMANCE ABILITY SCALE * 


Contents. For this scale, the particular tests included were 
selected from a variety of sources. The authors carried out their own 
standardization and revised the directions for administering and scor- 
ing. 

The tests included are the following: 

(1) Manikin and Profile. These are already familiar to the reader. 
They are scored for accuracy and time required. 

(2) Block Designs. These are the familiar Kohs colored-block de- 
signs, five of which were included. They are scored for accuracy and 


time required. eet ó 5 
(3) Picture Arrangement. This includes ten series of pictures 


which, though different in subject matter, are the same in principle 
as those in the Bellevue scale. They are scored for accuracy only. 

(4) Digit-Symbol. This type test is also familiar to the reader; for 
it is of the same kind as that included in the Bellevue and other 
scales. It is scored for accuracy and time required. 

(5) Memory for Designs. This test includes five cards, on each of 
which is a geometric design. The subject is asked to reproduce each 
design after it has been shown for ten seconds. This type of test is 
similar to that used by Binet. The score depends upon quality of re- 


production. 


(6) Cube Construction. This test utilizes blocks some sides of 


which are painted while others are not. The examiner presents 
models of cube construction and asks the subject to duplicate them. 


Th depends upon both accuracy and time, 
Ce Hla Completion. (This is an optional substitute for test 3.) 


The Healy Picture-Completion Test II was selected. The score de- 

nae cea accuracy only. (This test is the same in principle as 

Picture-Completion J; but its theme is different and on a higher level 

of difficulty.) 

e Cornell-Coxe scale differs from other per- 

formance scales in that it does not include any form boards. 

Scoring. In order that each test in the scale might contribute equally 

to the total score, the authors follow the common practice of con- 

verting raw eres into a type of standard unit." Total score is ob- 
s y 


It will be noted that th 


[EL L. C Il and W. W: Coxe, A Performance Ability Scale, Yonkers, 
. ” orne! x, h; 


N. Y.: World Book, 1934. 
° Ibid., pp. 29 ff. 


204 Individual Performance Scales 


tained by adding the weighted scores for the several parts. A mental 
age is then obtained from a table of norms, extending from an MA 
of 4 years and 6 months to 16 years and 8 months. 


Validity and Reliability. It appears from experimental data, cited 
by the authors, that they used the following criteria of validity: a 
“satisfactory” distribution of scores by school grades (increasing 
averages in successive grades); a distribution of scores that conforms 
well to the symmetrical bell-shaped curve; a high correlation between 
scores on each test in the scale with total scores. Although the cor- 
relation between total performance scores and chronological ages is 
.78, Cornell and Coxe do not regard CA as a validating criterion of 
first importance. 

Reliability coefficients of the several parts of the scale varied from 
.66 to .89, while for total scores the reliability coefficient was .929 
(125 cases). These data indicate a satisfactory degree of reliability 
for the scale as a whole. 

The authors of this scale maintain that a performance scale should 
not be a substitute for those of the Binet type and other verbal tests, 
but should supplement them. They sought, therefore, to devise an 
instrument which should differ from these others in respect to func- 
tions tested. In this, they apparently succeeded fairly well; for al- 
though the correlation coefficient found between total performance 
score and Stanford-Binet (1916) mental ages was .79 for a wide age- 
range, when chronological age was held constant the partial correla- 
tion coefficient was reduced to .38. This means that in a group of 
about the same age the results of the two scales would indicate rela- 
tively low community of function.’ 

It appears also that the several tests within the scale have only 
very moderate community of function. Intercorrelations of the parts 
ranged from coefficients of .50 to .75, over the entire age-range; but 
when chronological age was held constant, the partial correlation 
coefficient varied from about .20 to .60. 

In constructing their scale, Cornell and Coxe were interested pri- 
marily in developing a supplementary instrument. They say in this 
regard: “One important value of any scale supplementary to the 


7 Practically the same results were obtained when the Cornell-Coxe scale 
was correlated with the National Intelligence Test, a group test of the verbal 
type. The simple correlation coefficient was found to be .74. 


The Arthur Point Scale of Performance Tests 205 


Binet scale lies in the fact that if the two scales used give different 
results, the psychologist’s attention is directed toward discovering 
reasons for whatever differences may be found, and his analysis and 
interpretations are thereby enriched and tend to have greater valid- 
ity.” * In view of the data on the degree of correspondence between 
results obtained with this performance scale, on the one hand, and 
those obtained with Binet revisions and verbal group tests, on the 
other, it is reasonable to conclude that the Cornell-Coxe scale serves 


the purpose for which it is intended. 
THE ARTHUR POINT SCALE OF PERFORMANCE TESTS ° 


Contents. Form I of this scale is a restandardization of some 
of the tests used in the Pintner-Paterson, plus two other tests. The 
eight parts are: Knox Cube Test, Seguin Form Board, Two-Figure 


Form Board, Casuist Form Board, Manikin, Feature Profile, Mare 


—— 


ric. 8.2. Porteus Maze Tests—Years 5 and 14. C. H. Stoelting Company 
3.0.4, FO cue (By permission.) 


Thid., p. 37. i x 
9 : ; -ale of Performance Tests, New York: The Com- 
fet ek. Pon S Vel 2, 1933; Vol. 1 appeared, also, in a re- 


vised edition in 1943. 


206 Individual Performance Scales 


and Foal, Healy Picture Completion I. The two additional are: 
Porteus Maze Test and Kohs Block Design Test. 

The Porteus test consists of a series of mazes of increasing diffi- 
culty, each printed on a separate sheet. The subject is required to 
trace, with pencil, the course from entrance to exit. The Kohs test 
consists of the same set of blocks used in the Bellevue scale, but dif- 
ferent designs are to be reproduced. 

The purpose of the restandardization of the scale is to provide a 
more reliable and useful performance scale for clinicians. Form I is 
based upon results obtained with 1100 public-school children of mid- 
dle-class families. The usual validating criteria were applied, such as 
parental occupation, age-grade distribution, significant increases in 
score in successive ages, and degree of correspondence with ratings 
obtained by means of other scales already considered to be acceptably 
valid (Stanford-Binet and Kuhlmann-Binet). 


Scoring. An individual’s score is variously determined on the several 
tests by number of successes, or time required, or degree of accuracy, 
or a combination of these. Each test yields a raw score which is con- 
verted into weighted score points. The raw score for each subtest is 
assigned a value proportional to the effectiveness of the subtest in 
differentiating between successive age-levels. The total of these 
weighted scores is converted into a mental age, The range of mental- 
age norms, in Form I, is from five and a half years to fifteen and a 


The formula assumes that the value of a test, hence its weight in the total 
score, depends upon the extent to which it differentiates between successive age 
groups. The greater the difference between age-group averages of a test arid 
the less the amount of overlapping of scores of two adjacent age-groups ‘the 
greater is the weight given to that test. The formula is: A 


M, — M, 
PE, + PE, 
a 


in which M, is the mean of the older of the two adjacent age groups; M, is the 
mean of the younger age group; the PE’s are the probable errors (middle 50 
percent of the scores) of the two groups. DV stands for “discriminative value. 
Inspection of this formula shows that as the difference between means increases, 
the numerator grows larger and discriminative value increases. But the DV is 
also dependent upon the sizes of the probable errors, which indicate the amount 
of overlapping of scores between the age group. Thus, also, as the probable 
errors are smaller and, therefore, the overlapping less, the denominator is the 
smaller; and the fraction—hence the DV—is larger. See Arthur, op. cit., Vol. 1, 
1943, p. 39. 


DV = 


alue.” 


Revised Arthur Scale: Form IT 207 


half. Arthur employs a statistical device to extend the norms down- 
ward by six months by using a constant monthly rate of decrease in 
score. This is a procedure of very doubtful validity because it assumes 
that rate of psychological development—as represented by these tests 
—is constant in the early years, whereas the prevailing conception 
among psychologists is that rate of development is most rapid in the 
earliest years and decreases as the child grows older." The scale is 
similarly extended at the upper end by adding a constant value, in 
points, to provide hypothetical mental-age ratings beyond the norms 
derived by the actual standardization process. When mental ages 
and intelligence quotients are obtained from norms thus extrapolated, 
the examiner must clearly realize that he is deriving indexes which do 
not necessarily have the same meaning as those found by actual stand- 


ardization. 


REVISED ARTHUR SCALE: FORM II 
The purpose of the second form is to serve as an alternate 
when retesting, and for use with preschool children. 

This version of the scale utilizes four of the test types already 
described, namely, Knox Cube, Seguin Form Board, Porteus Mazes, 
and Healy Pictorial Completion II. The only new type of material not 
thus far described is the Arthur Stencil Design. This test employs 
twenty designs, increasingly complex and more difficult to reproduce, 
which are presented singly. The testee is given six square colored 
cards and twelve colored stencils which are cut within square cards. 
Each design is to be reproduced by placing the appropriate cards and 
S er, so as to duplicate the original in both 
actice design requires merely that 
a white card to get the desired 


stencils, one upon anoth 
form and color. For example, a pr 
a red octagonal stencil be laid over 
result, 


Scores of Forms I and II were correlated at each age 
rs. The coefficients, with CA constant, 
8, to .70 (PE = .05) at ages 10 


Reliability. 
level from six to sixteen yea 
ranged from .55 (PE = .06) at age 
Pae e 

this. 
s with the method employed in constructing 
K eve he Bellevue is designed pri- 

the Bell : membering, however, that t „desig p! 
ie Jevne sple iie the Arthur scale is not. Compare also with method of 
deriving mental ages and intelligence quotients by means of the Stanford- 


Binet scale at levels of superior adults. 


1 Dr, Arthur herself recognizes 
12 The reader should contrast thi 


208 Individual Performance Scales 


and 15. The median coefficient was .61 (PE = .06). As estimates of 
reliability, these coefficients are relatively low. The small number of 
cases in each age group, varying from 41 to 54, might account in 
part for these results. 

That the scale is more reliable than the foregoing data suggest is 
indicated by the results of another study in which the subjects were 
61 institutionalized mentally deficient boys whose mean IQ on the 
Stanford-Binet was 67. They were tested with the Arthur scale, then 
retested after an interval of two years, the correlation between the 
two sets of scores being .85. The coefficients for each of the parts 
varied from .69 to .80. The over-all results indicate satisfactory sta- 
bility of relative rank, especially in view of the narrow range of ability 
within the group of subjects tested. It is to be noted, however, that 
the mean gain in the Arthur scale IQ was ten points, as contrasted 
with a mean loss of only one IQ point on the Stanford-Binet during 
the same interval.“ 

The mean gain of ten points may be attributed to one or both of 
two factors: (1) scores on the performance type of test are more 
susceptible to practice (learning) than are those on the verbal type; 
or (2) residence and training in a soundly conceived and operated 
institution encourages the development and utilization of the poten- 
tialities of the mentally deficient beyond levels attained under ordinary 
circumstances. The latter factor is not synonymous with specific prac- 
tice and learning effect. It is, rather, a result of general training in 
more effective behavior; and, in some instances, also the removal of 
“blocks” that impair one’s performance. 


Validity. This performance scale was devised primarily as a clinical 
instrument to be used as a substitute for the Binet revisions in cases 
where a verbal type of scale is inappropriate, as in instances of 
language handicap, defects of vision or hearing, and inequality in 
development of an individual’s verbal and nonverbal functions. Ar- 
thur, in her standardization procedures, has taken the Position that 
the basic capacities demanded by the Binet tests and by her per- 
formance tests should be essentially the same; thus, the Kuhlmann 
and Stanford revisions of the Binet are two principal validating criteria 


> SR, M. Patterson, “The Significance of Practice Effect upon Re-administras 
tion of the Grace Arthur Performance Scale to High Grade Mentally Deficient 
‘Children,” American Journal of Mental Deficiency, Vol. 50, 1946, pp. 393-401. 


Revised Arthur Scale: Form II 209 


used in constructing her scale. Accordingly, the main differences be- 
tween the Arthur and the Binet revisions should be in the types of 
materials used to sample the psychological functions. 

The extent to which the Arthur and Binet scales actually corre- 
spond may be inferred from a comparison of IQ ratings obtained by 
subjects examined with both, and from the correlation coefficients 
obtained between the two sets of ratings, taken separately for each 
age group. 

In the first place, Arthur reports that the probable error (PE) of 
intelligence quotients was 4.97 points when the performance scale 


TABLE 36 
Correlations of Stanford-Binet IQ's 
and Arthur IQ’s 

Age N R 

5 35 -70 = .06 

6 54 77 04 

7 50 68 + .05 

8 44 74 05 

9 41 80 + .04+ 
10 40 51 = 08 
1] 44 68 + .05 
12 3 80 + .04 

3 27 21.12 
14 27 OTE 
15 16 —.10 + .17 


e of the Kuhlmann-Binet, and 4.92 
Stanford-Binet. What this means is 
the IQ differences were five points 
percent the differences were 


ratings were compared with thos 
Points when compared with the 
that in fifty percent of the cases, 
or less, while in the remaining fifty $ 
Sreater. The frequencies of the differences, however, decline markedly 
as the size of the differences increases beyond five points. 
Correlation coefficients between Stanford-Binet (1916) IQ’s and 
Arthur scale 1Q’s (Form I) are shown in Table 36." Most of these 
Coefficients are rather high and noteworthy. Excepting at age ten, they 
show unusually high correspondence for these two types of scales be- 
tween ages five and twelve. The correlations at the later ages, however, 
are so low and the probable errors (PE) so large that the coefficients 


“ Calculated from data in Arthur, op. cit Vol. 2, pp. 54-61. 


210 Individual Performance Scales 


may be regarded as being zero for all practical purposes. If the table of 
coefficients is representative of the correspondence existing between 
Stanford-Binet IQ’s and Arthur Performance IQ’s at ages above 
twelve, then we must conclude that the latter scale has been inade- 
quately standardized or is incapable of differentiating among individ- 
uals at the later age-levels in respect to the functions being measured. 

Although the coefficients for ages five to twelve are quite marked, 
and in several instances high, they are, nevertheless, not close enough 
to unity (+1.00) to warrant the use of the Stanford-Binet and the 
Arthur scales interchangeably. For clinical purposes, the Arthur scale 
is valuable within the age range of five to twelve as a supplement to 
verbal scales of the Stanford-Binet type. 

This conclusion is further supported by validating data obtained 
subsequent to the publication of Arthur’s manual. These later studies, 
using both the 1916 and 1937 revisions of the Stanford-Binet, can 
be summarized as follows: 


Stanford-Binet and Arthur IQ’s correlate variously, from about .50 
to about .80. 


In a large majority of cases, the Arthur scale IQ’s tend to be some- 
what higher than the S-B at levels below 90 IQ. 


At the levels above 90 IQ, the S-B tends to yield somewhat higher 
ratings. 


The means of the differences between the I1Q’s of the two scales 
have been found to range from about 5 to 10 points. 


While “discriminative value,” regarded by Arthur as a most im- 
portant criterion of validity, is no doubt significant, it does not in itself 


demonstrate that a scale is measuring the functions it has set out to 
measure. 


There is, therefore, need for more definitive studies of the validity 
of the Arthur scale, using large enough numbers of subjects, including 


not only clinical and institutional groups, but a normal population 
sampling as well. 


Arthur’s own position appears to be that the appreciable extent 
of agreement between Stanford-Binet test results and those of her 
own performance scale indicates rather even development and mani- 
festation of psychological functions in general. But, she believes, if 


Other Performance Tests 211 


the results of these two scales disagree significantly in the case of a 
given individual, this is due to unevenness in development and ex- 
pression of functions; or some complicating nonintellectual factors 
are responsible for the discrepancy. 

In the case of any individual instance, the actual interpretation of 
performance test results, taken in conjunction with verbal test find- 
ings, will depend upon all information available with regard to the 
person concerned and upon the psychologist’s interrelating of all 


relevant facts and data. 


OTHER PERFORMANCE TESTS 


It is not our purpose to present a description of all available 
tests of the performance type. We have presented several in some de- 
tail in order to acquaint the reader with their nature and their uses. 
Those scales described are typical. There are, however, several others 
which are intended to serve a special purpose; of these, three will be 


very briefly described. 


The Ferguson Form Boards.” The first description of these was pub- 
lished in 1920. They consist of a series of six form boards used as 
a unit and progressing in difficulty by fairly equal intervals. The tests 
were standardized upon 364 subjects ranging from children in grade 
one to college seniors. 

Ferguson, apparently, used grade placement and school achieve- 
ment as the principal evaluating criteria; for he reported correlations 
of his form board scores as follows: with grade placement, .81; with 
teachers’ estimates of intelligence, .50; with class standing, .56. 

Since their appearance in 1920, these form boards have been sub- 
jected to experimentation from time to time for purposes of revising 
Procedure in administering and scoring, and providing more adequate 
norms. One of the most thorough revisions is that by Wood and 
Kumin (see footnote 15) who give norms for the ages of 7 years 
and 6 months to 17 years and 5 months. But, as is the case with many 
other tests of this type. a Vety large percentage of the standardization 
Population consisted of individuals who had come to a guidance 


© G. O: Ferguson, “A Series of Formboards,” Journal of Experimental Psy- 
chology, Vol. 2, 1920, pp. 47-58: L. Wood and E. Kumin, “A New Standardiza- 


tion of the Ferguson Formboards,” The Journal of Genetic Psychology, Vol. 
54, 1939, pp. 265-284. 


212 Individual Performance Scales 


clinic for assistance and who, therefore, may not be representative 
of the general population of their age groups. It is necessary to con- 
sider this fact when using and interpreting results of performance 
tests so standardized. 

For present purposes, the fact of major interest is the extent to 
which these form boards and the scales of the Stanford-Binet type do 
or do not test common functions. Wood and Kumin report the fol- 
lowing correlation coefficients: Ferguson score and Stanford-Binet 
(1916) mental age, simple correlations, .54 for boys and .55 for girls; 
partial correlations, when chronological age is held constant, .34 for 
boys and .47 for girls. These coefficients indicate a relatively low 
community of function between the two tests. In this respect, the 
Ferguson form boards are quite similar to nearly all the other per- 
formance tests presented in this chapter. 


Kent-Shakow Form Board Series.” This series of four form boards, 
devised by Kent and Shakow, was made available in 1925, as the 
Worcester Formboard Series. The series presents eight separate tasks 
graded in difficulty. In 1928 a modified series appeared, available in 
two forms: the Industrial Model and the Clinical Model which differ 
in respect to size, the former being the larger. The scale was first 
developed as a clinical instrument, with special reference to the needs 
of the Worcester (Massachusetts) State Hospital out-patient depart- 
ment. The standardization population consisted of 150 subjects from 
the age of six years upward, including adults. The early standardiza- 
tion, however, was very inadequate. Since these form boards were 
used frequently with adult subjects, better-standardized adult norms 
were published in 1939, based upon a population of average and 
superior adults numbering 355. The authors of this performance test 
do not attempt to validate it in terms of other established tests. Appar- 
ently, they are concerned mainly with providing a clinical device 


16 


The results were of the same general order of magnitude when the Kuhl- 
mann-Anderson tests were correlated with Ferguson scores, namely, simple cor- 
relation coefficients of .47 for boys and .45 for girls. 

1" D, Shakow and G. H. Kent, “The Worcester Formboard Series,” Peda- 
gogical Seminary and Journal of Genetic Psychology, Vol. 32, 1925, pp. 599- 
611; G. H. Kent and D. Shakow, “Graded Series of Formboards,” Personnel 
Journal, Vol. 7, 1928, pp. 115-120; D. Shakow and B. Pazeian, “Adult Norms 
for the K-S Clinical Formboards,” Journal of Applied Psychology, Vol. 23, 
1939, pp. 495-502; W. R. Grove, “Modification of the Kent-Shakow Formboard 
Series,” Journal of Psychology, Vol. 7, 1939, pp. 385-397. 


Other Performance Tests 213 


which, they believe, measures manipulative skill and form analysis, 
and which, presumably, also provides a means of observing the 
subject’s modes of approaching a problem. 

In 1939 Grove published a modification of the Kent-Shakow series, 
Industrial Model, based upon an experimental group of 300 “native 


mule 


SUBTEST A 
30556 
SUBTEST B 
JANGE 

g SUBTEST C 

A EA 

No =a 

SSS 
SUBTEST D 


ric. 8.3. Modified Kent-Shakow 
Form Board Series. (By permission 
of William R. Grove.) 


born white adult male prisoners incarcerated in Western Penitentiary 
(Pennsylvania) .” It is especially noteworthy that the scores obtained 
on this modified series of form boards, when correlated with Stanford- 
Binet (1916) ratings, yield a coefficient of only .43 + .03 (PE). The 
effect of age on this correlation is negligible, since it is based on a 
comparatively homogeneous group of about 300 adults. Thus, there 
is relatively little community of function measured by these two tests. 
Grove believes that his revised series measures ability to solve prob- 


214 Individual Performance Scales 


lems presented in the form of concrete spatial relations. It is reason- 
able to assume that the original Kent-Shakow tests measure the same 
functions as does the modified series, even though the technical terms 
used to name these functions are different. 


The Carl Hollow Square Scale. This is a form-board test designed 
for use primarily with adults, though it is usable also with children 
over ten years of age. The test consists of a “. . . wooden panel in 
which is cut a 4% inch square hole, and 29 blocks of varying straight 
line geometric forms, each having both straight and beveled edges. 
There are long rectangles, short rectangles, three classes of right tri- 
angles, diagonally truncated long rectangles, diagonally truncated 
short rectangles, and overlapping rectangle-triangles.” ** The problem 
for the testee is to fill the hole with sets of blocks, in a series of twenty 
tasks which become progressively more complex and difficult. Per- 
formance is scored on the basis of time required and number of moves 
made. Total scores may be converted into IQ’s and MA’s, or into 
percentile ranks. 

This form-board test, its constructor believes, measures the fol- 
lowing psychological processes: auditory memory (remembering prin- 
ciples and rules given in the instructions); visual memory (recall of 
partially repetitive patterns); observation and attention to detail; 
visual imagery involving synthesis and analysis (planning the place- 
ment of blocks without actual manipulation); learning (carry over 
from earlier tasks to subsequent ones). 

It appears that the principal validating criterion employed was 
correspondence with results obtained with other tests having rather 
wide acceptance (Stanford-Binet, Kohs Block Design, Otis group 
test, Terman group test, Thorndike-McCall Reading Test). Correla- 
tion coefficients between these criteria and the form-board results 
varied, in the case of adults, from .50 to .80; while in the case of chil- 
dren over 10 years, the coefficients were from about .60 to .80. With 
adults, the factor of chronological age would not tend to produce 
spuriously high coefficients; but when the coefficients found with 
children are being interpreted, due regard should be given to the age 
range from 10 to 16 years (the theoretical beginning of adult level). 


15G. P. Carl, “A New Performance Test for Adults and Older Children: 
The Carl Hollow Square Scale,” Journal of Psychology, Vol. 7, 1939, pp. 179- 
199. 


Functions Tested by Performance Scales 215 


A coefficient of reliability of .87 is reported for this test, indicating 
an acceptable degree of consistency that compares favorably with 
other tests of the verbal as well as the nonverbal type. 

While the author grants that his performance test measures mental 
abilities which are involved more in the concrete and practical aspects 
of activity than in the abstract, he also maintains that it is more a 
measure of general than special ability. His view, presumably, is based 
on the correlations with the other tests of general ability already men- 
tioned. This question will be dealt with in the section evaluating per- 
formance scales as a group. 


FUNCTIONS TESTED BY PERFORMANCE SCALES 


This discussion will supplement the analysis that was pre- 
sented in connection with the nonverbal parts of the Stanford-Binet 
and the Wechsler scales. The reader should refer to those for more 
detail. 

Since all performance tests involve visual perception and manipula- 
tion of objects, the number of types of items is relatively limited. It 
is not surprising, therefore, to find that the range of psychological 
functions is also restricted. This is one reason why the correlations 
between performance scales and scales of the Stanford-Binet type are 
not higher than they are, since the latter can sample a much wider 
range of functioning. 

If the reader will re-examine the descriptions of the fifteen subtests 
in the Pintner-Paterson scale, and the few other types introduced in 
later scales, it will be readily apparent that they may all be classified 


in one of a few categories: 


geometric form boards, with variations, from the very simple to 
rather complex 

picture form boards (also known as picture completion) of 
various degrees of complexity 

block designs from simple to complex mazes of varying degrees 
of complexity 

recall of geometric designs 

Picture arrangement 

block building 

cube sequences (imitating the order of tapping a series of cubes) 

digit-symbol 


216 Individual Performance Scales 


With the exception of the last two types, the performance items 
test visual perception plus more or less visual insight requiring analysis 
and synthesis. Performance on all the eight types also reflects motor 
speed in varying degrees; and performance on these is facilitated by 
visual imagery, that is, by the ability to analyze or synthesize a pattern 
imaginally before actually going through the movements. 

Block building and cube tapping sequences are largely matters of 
imitation requiring the functioning of visual imagery and recall. The 
digit-symbol test, as already indicated in connection with the Belle- 
vue, utilizes immediate rote recall and visual imagery. 

In all of the performance tests, visual-motor integration, affecting 
the speed with which a person responds, is involved. 

Also, many clinical psychologists hold that performance tests pro- 
vide an estimate of the subject’s attention span, especially in the case 
of mentally retarded and deficient individuals. Attention span, how- 
ever, is not a process in the sense that visual analysis, memory span, 
etc., are. Attention is, rather, an attribute of the situation in which an 
individual is placed. If the testee is interested in the task at hand, and 
if the test is within the range of his apprehension, he will be attentive, 
If he does not understand the task, and if he is unable to make any 
progress with it and is confronted by repeated failure, he will very 
probably be inattentive. 

It will be noted that these test items make few or no demands upon 
abstraction," concept formation, or the necessity of transcending the 
immediate concrete situation. For this reason, performance tests are 
regarded as having limited value as measures of general capacity, 
especially in the testing of individuals who are above average level. 


EVALUATION OF PERFORMANCE TESTS 


These tests were first constructed as substitutes for verbal 
scales of the Binet type. Many correlational studies showed, however, 
that it is sounder practice to regard the former as supplements to the 
latter. The reason for this interpretation is that, when allowance is 
made for the factor of chronological age, psychologists have found in 
almost every instance that the coefficients of correlation fall at .50 
or lower—generally lower. Hence, although the two types of tests 
measure some functions in common, or are in other ways interrelated, 


19 “Abstraction” means the separation of a quality, or an idea, or a principle 


from the organization and details of a concrete situation. 


Evaluation of Performance Tests 217 


each type also measures functions different from those of the other 
type. 

Performance tests have been found most useful with persons handi- 
capped by language disabilities: the deaf, the non-English speaking 
groups, the illiterate, and those who have speech or reading diffi- 
culties. 

These tests are valuable also in helping to identify children who are 
shy or inarticulate because of emotional reasons and who, therefore, 
may appear at a disadvantage on verbal tests of mental ability. 

Performance tests, used together with the verbal type, are helpful 
in identifying the mentally deficient and mentally retarded with in- 
creased certainty. In cases involving diagnosis of mental deficiency, 
it is often desirable to supplement the Stanford-Binet, or a similar 
scale, with performance tests in order to check on the probable roles 
of the language factor, lack of cultural opportunities, and poor edu- 
cational experience in order to estimate to what extent these might 
have adversely affected the testee’s score on the verbal type of scale. 
If a significant difference is found between the two obtained ratings, 
further study of the individual is indicated before a conclusion is 
reached. It has often been found that mentally deficient and mentally 
retarded persons obtain somewhat higher ratings on performance than 
on the Stanford-Binet and similar scales. The differences, however, 
are not always significant enough to raise a question regarding the 
diagnosis. Furthermore, it is to be expected that, in many cases tested, 
the differences would be in the direction stated, since the performance 
tests were given for the very reason that the examining psychologist 
judged that the testee might be more successful with performance test 
problems. 

Arthur, on the other hand, reported that for 435 clinic cases, 
having Stanford-Binet 1Q’s of less than 95, there was no group trend 
in the direction of higher IQ ratings on the Arthur Performance Scale, 
even among the duller individuals in the group.’ She also found that 
for 60 mentally deficient cases (ages 15 to 20) who were routinely 
examined, the differences between verbal and performance IQ’s were, 
on the whole, negligible. These are significant findings with regard to 


A Grace Arthur, “An Attempt to Sort Children with Specific Reading Dis- 
ability from Other Non-Readers,” Journal of Applied Psychology, Vol. 11, 
1927, pp. 251-264; “The Relative Difficulty of Various Tests for Sixty Feeble- 
minded Individuals,” Journal of Clinical Psychology, Vol. 6, 1950, pp. 276-279. 


218 Individual Performance Scales 


group trends and group generalizations. But they do not alter the fact 
that in some individual cases, however few, a performance scale might 
yield sufficiently discrepant results to preclude a diagnosis without 
further study of the case. A discrepancy between an individual’s rating 
on a performance scale and that on a verbal scale should be useful, 
rather than otherwise, in a clinical situation because then the psy- 
chologist must find the reasons for and seek an interpretation of the 
discrepancy. This necessity prevents the formulation of unwarranted 
conclusions, and it results in a fuller understanding of the person being 
examined. 

Among the advantages reported for performance tests are these: 
(1) since they do not require the use of language, individuals do not 
“block” as a result of feelings of inadequacy due to lack of formal 
schooling; (2) since all elements of the problem are visually present, 
some individuals proceed with greater confidence. 

Clinical psychologists are agreed that, where indicated, the use of 
performance scales can provide more information than just a rating 
in the form of a numerical index. These tests provide an Opportunity 
to observe qualitative aspects of behavior under standardized condi- 
tions in a variety of problem-situations. A subject’s approach to a 
problem might reveal, for example, a state of depression or agitation; 
hesitation or impetuousness; thoughtful deliberateness, bull-headed 
persistence, or easy discouragement; an insightful approach or one 
of haphazard trial-and-error. 

Performance tests also have their disadvantages and limitations. 
As already stated, they are limited in range of mental functioning 
tested; hence they do not differentiate well among individuals of better 
than average levels. In fact, most performance scales thus far de- 
veloped do not differentiate adequately among a large portion of the 
members of a representative population above twelve or thirteen years 
of age. On the whole, it is at the lower age levels and the lower mental 
levels that performance tests are most useful, in addition to their use- 
fulness with persons having the handicaps already mentioned. 

Geometric form boards, picture completion boards, object assem- 
blies, etc., are within almost universal experience of American school 
children; and since they make demands upon the significant mental 
processes already indicated, they may be regarded as, in some degree, 
measures of intelligence at these earlier age levels. However, as per- 
formance tests go higher in age levels and difficulty levels, they present 


Evaluation of Performance Tests 219 


problem-situations which are highly specialized and for the handling 
of which the subjects have by no means had a common background. 
This is especially true of the more complex and subtle form boards 
(e.g., Carl Hollow Square), performance on which is facilitated by 
training and experience in tasks requiring spatial perception, as in 
some types of engineering, cabinet making, and the like. Since these 
tests do not require, to a significant degree, the use of ability to make 
abstractions and to deal with concepts, they fail to measure some of 
the most important aspects of mental activity. 

The reader will have noted that not all authors of performance tests 
agree as to what their characteristics should be. The Arthur scale was 
constructed to provide a nonverbal substitute for the Binet revisions. 
Hence it was expected to have a very significant correlation with these 
revisions, and standardization proceeded on that principle. The 
Cornell-Coxe scale is intended not as a substitute for Binet revisions 
but as a supplement to them. Thus, this performance scale was con- 
structed on the principle that there should be a relatively low correla- 
tion between it and the verbal type of intelligence tests. Experimental 
evidence suggests that, at least in their present stage of development, 
performance scales are most properly and advantageously used in the 
manner advocated by Cornell and Coxe. For performance scales are 
instruments with which we may test development of insightful be- 
havior involving visual perception rather than through the use of 
symbols (language and number) which are essential for abstractions, 
concept formation, ideational reasoning, and ability to deal with 
problems extending beyond one’s immediate, concrete environment. 


9. 


Prrvrnneerennnvonteen Sionini nian ian ao oon occa coco Don eaDSSSSEESSSSSSSSOOI00000000000000000000500055 55550 000000000 000007 


SCALES FOR INFANTS AND 
PRESCHOOL CHILDREN 


IN THIS chapter we shall present several representative scales de- 
vised to evaluate mental development of individuals ranging in age 
from one month to six years. Some of these scales, for the greatest 
part, are not tests as that term is commonly understood. They are, 
rather, norms and inventories of development and behavior, grouped 
at their respective average age levels, derived from observation of 
children’s behavior and from experimentation in a variety of situa- 
tions. All are administered individually, of course. 


GESELL DEVELOPMENTAL SCHEDULES 


These are a product of systematic study of infants and young 
children at the Yale Clinic of Child Development. The first schedule,! 
an early effort published in 1925, provided rather crude norms at the 
following age levels: 4, 6, 9, 12, 18, 24, 36, 48, and 60 months. 

At each level the inventory of activities was divided into four 
categories of behavior: (1) motor, (2) adaptive, (3) language, (4) 
personal-social. Although the normative schedules themselves have 
undergone considerable revision and refinement since their first ap- 
pearance, these four categories, with some minor variations in termi- 
nology and analysis at times, have remained throughout. Motor be- 
havior is said to be of value “. . . because it has so many neuro- 
logical implications, and because motor capacities of the child con- 


1A. Gesell, The Mental Growth of the Preschool Child, New York: Mac- 
millan, 1925. 


Gesell Developmental Schedules 221 


stitute the natural starting point for an estimate of his maturity.” In 
adaptive behavior “. . . we reckon with the finer sensori-motor ad- 
justments to objects and situations: the coordination of eyes and 
hands in reaching and manipulation; . . . the capacity to initiate 
new adjustments in the presence of simple problem situations which 
we set before the infant.” Language behavior, broadly used, includes 
“ all visible and audible forms of communication, whether by 
facial expression, gesture, postural movements, vocalizations, words, 
phrases, or sentences. [It] includes mimicry and comprehension of 
the communications of others.” Personal-social behavior “. . . com- 
prises the child’s personal reactions to the social culture in which he 
lives [bladder and bowel control, feeding abilities, sense of property, 
self-dependence in play, cooperativeness, responsiveness to training 
and social conventions].” * 


The Infant Schedule. The schedules of 1925 were followed by re- 
ports of further investigations upon which revised normative inven- 
tories of behavior were based. One schedule has been devised for the 
examination of infants between ages of four weeks and fifty-six weeks. 
At the four week level, the inventory of behavior includes analysis of 
head control, arm-hand posture, leg-foot posture, body posture and 
progression, regard, prehension, language and social behavior. At the 
fifty-six week level, the inventory includes the following categories: 
body posture and progression, prehension, manipulation and adapta- 
tion, language and social behavior.* Each of these categories of be- 
havior is evaluated, in the case of a particular infant, by observing 
him in a number of situations." Each situation is broken down into a 
number of possible activities detailing the manner in which the infant 
might respond. (See Table 37.°) Since the enumerated responses fol- 
low diverse trends with age, they have been designated as follows: 
(1) decreasing trend, if at ascending ages there is a progressive de- 


2 A. Gesell and C. S. Amatruda, Developmental Diagnosis, revised edition, 
New York: P. B. Hoeber, 1947, pp. 5-6. 

3 A. Gesell and H. Thompson, The Psychology of Early Growth, New York: 
Macmillan, 1938, pp. 147 ff. 

1 For example, activity with a ball, a bell, rattle, cubes, cup, spoon, form- 
board, mirror, boxes, pellet and bottle, dangling ring, paper and crayon; pat- 
terns of body posture; locomotion; spontaneous activities in various daily situa- 
tions, such as in toilet, bath, and crib; kinds of play; responses to people; kinds 
of vocalization. 

®° From Gesell and Thompson, The Psychology of Early Growth, New York: 
Macmillan, 1938, p. 127. (BY permission. ) 


222 Scales for Infants and Preschool Children 


TABLE 37 


Dangling Ring Behavior (4 weeks—28 weeks) 
Situation: Dangling Ring (RD) 


RD | Behavior Items 4| 6 | 8 |12|16|20| 24 | 28 


(1) | Regards after delay .......... 77) 54| 64)65)27/13) 14) 5 
(2) | Regards immediately 26| 46| 36) 35|68/97] 96| 95 
(3) | Regards momentarily 53 h 85:1 ASR BS <a) | a 
) | Regards prolongedly .. 47| 43] 29|62|87147| 38] 5 
(5) | Regards consistently . .. sataa | vik Ma TANS] 59] “90 
) | Disregards in midplane . VTE | 239i) S646 TA oil of, 


(7) | Regards in midplane .......... 29| 61| 54/54] 86 
(8) | Regards in midplane 
(lone Read} O etn 22| 25] 12/50] 83 
(9) | Regards in midplane 
(r6und head): sc. sa xeeewas wra 32| 75| 70|56| 88 O a 
(10) | Regards ring in hand .. se fee j| o | -= |=. 166] 821/100] 100 
(11) | Regards string ........ sen te a et kal TTG Gh]! SS 
CLAS MShifts regard «cca: a ek iea 94| 100 | 100 | 96 | 93 | 46| 38 41 
(13) | Shifts regard to surroundings ... |75| 68| 61]35|13}16 14| 5 
(14) | Shifts regard to Examiner’s hand |28| 64| 61|77| 48] .. ba LA 
(15) | Shifts regard to Examiner ..... 41| 54| 57|65|64|27| 24 27 
(16) | Shifts regard to hand ......... O} al OT STG] 8] 33, 
(17) | Follows past midplane ........ 44| 62| 50/58] 84 
(18) | Follows past midplane (Ig. h.) .|20] 33] 25/37/83 
(19) | Follows past midplane (rd. h.) . |55| 75| 60/67 77 
(20) | Follows approximately 180° ...| 16 43| 46]50]| 68 
(21) | Follows approximately 180° 
(lg. Tas) vas sue RA O} 11) 25)25] 83 
(22) | Follows approximately 180° 
(de A OE 36:1 55] 55/61/62|..] .. | .. 
(23) | Approaches --| Of Of} 11/12]62/89] 96] 100 
(24) | Approaches after delay . He we | cee fe] 20180) J9 9 
(25) | Approaches promptly . . ANa Sas 32/66] 81| 91 
(26) | Arms increase activity . 0f 4| 11/42 64 
(27) | Arms separate ........... oj Of 4|15|17|19|` 7 
(28) | Approaches with one hand .... . 0| O} 4/12)20] 24 39| 55 
(29) | Approaches with both hands ...| 0 0 0 0 50 76 82| 77 
(30) | Approaches with arms flexed ...] of o 0|12|44|60| 54| 14 
(31) | Hands come together ......... 0 0 0 8 20|)38} 11 5 
(32), (Gonticts ring: ioe. es aeeoa 3| 4| 4/15143/81] 100 100 
(33) | Dislodges ring on contact . 3) 4] 4] 8]20|35 28 5 
(34) | Giaspst ace Lere e cna sen -| 0] oj of 8]22}73] 96] 100 
(35) | Grasps after delay if grasps .... |.. eer ||) tees =| ee 75| 46| 14 
(36) | Grasps interdigitally .......... S ET Sh OR S > ; 61 £ 7 
(37) | Retains entire period .... EA eee ad A A 20 19| 40| 65 
(38) | Holds with both hands ce] oo | oe | a 10133156] 67 
(39) | Hand opens and closes on ring... f.. J.. i |): 30111] 10] 14 
(40) | Brings ring to mouth ......... a [ove | oe | +. (38158) $2] 74 


Gesell Developmental Schedules 223 
TABLE 37 (continued) 
RD Behavior Items 4| 6 8 |12|16|20| 24 | 28 
(41) | Free hand to midplane [25|51] 56| 84 
SANER a TTE 3/18| 41| 74 
(43) | DHODS oeenn 78|56| 41| 32 
(44) | Drops immediately . T Rae ET 42/32) 7| 0 
(45) | Regards dropped ring if drops ..|..] -. | + 10} 37} 43|100 
(46) | (If drops) pursues dropped ring ts ie 7/16] 29}100 
(47) | (If drops) resecures dropped ring |.. | .. | .. |..| 7] 5] 29] 60 
(48) | Rolls to side ......-++-eee00> 3) 4| 8| 4|35|42| 38| 18 
(II FE o i pares tis gicus irais 9i 14) 4| 8127/23] 32| 21 


crease in percentage of infants showing that behavior; (2) increasing 
trend, if at ascending ages there is a progressive increase in percentage 
showing that behavior; (3) focal trend, if at consecutive ages there is 
an increase, followed by a decrease in percentage giving that response. 
The “increasing” and “decreasing” behavior items were allocated to 
age levels on the basis of fifty percent frequency. The “focal” behavior 
items were placed at age levels at which they are most frequently ob- 
served.’ 

Scoring. The infant’s responses are scored plus or minus, depending 
upon whether or not he manifests the enumerated behaviors. The 
score on each item of behavior is then noted on a record sheet in 
accordance with the categories listed above. The infant’s “distinctive” 
(modal) level of behavior is found by observation; from this level, 
responses showing greater or lesser degrees of maturity are counted; 
an algebraic sum of the deviating responses is found; this sum is then 
related to the “distinctive” level in order to determine whether the 
trend is in a plus or minus direction. Finally, a rating is assigned the 
infant in each of the categories, thus providing a profile of develop- 
ment. The following scores are illustrative of the ratings that might 
be found in a particular case.’ 


postural behavior IG- weeks 
prehensory behavior 38+ weeks 
perceptual behavior 28 weeks 
adaptive behavior 28 weeks 
language behavior 33 weeks 


“A fourth type of item was also found: those having a fluctuating trend; 
that is, having more than one focus. But they are disregarded in the scoring. 

: * For a detailed account of scoring method, see Gesell and Thompson, op- 
cit., pp. 209 ff. 


22 Scales for Infants and Preschool Children 


Scoring of the items in the several categories as plus or minus de- 
mands considerable clinical experience in observing and evaluating 
infant behavior. Also, once having scored the items, an appreciable 
element of subjectivity enters into the reading and interpretation of 
the scoring record sheet. 


Validity and Reliability. This developmental schedule was not sub- 
jected to the usual tests of validity because its builders had no criteria 
of maturity levels to use as a basis of comparison. They do, however, 
present a rationale of their schedule as the basis for its validity. 
“Fundamentally the validity of the schedule here offered depends on 
the validity of the norms, the legitimacy of the category classifications, 
the appropriateness of each item for the category to which it is allo- 
cated, the soundness of the concept of maturity level, and the justness 
of using a sample of the child’s behavior to indicate that level. . . 
Our conclusions regarding them [the foregoing issues] go beyond 
experimental data and are based on years of clinical experience, We 
are justified in claiming their general soundness and Practical appli- 
cability until contrary evidence is revealed.” * 

It is maintained, likewise, by the authors of the schedule that re- 
liability of their schedule cannot be determined by the usu 
methods presented in connection with other types of tests 
Thompson state, “It is the systematic error 
tried to reduce by basing the schedule on a carefully planned and 
controlled study of infant behavior . . . the accuracy of the schedule 
resolves in last analysis to the question of the accuracy of each item 
of the norms.” ” And they believe the percentages at different age 
levels passing and failing various behavior items, as derived from 
their normative studies, are reliable to a satisfactory degree, 


al statistical 
. Gesell and 
[of testing] which we have 


The Preschool Schedule. Gesell and his colleagues have 
veloped a scale extending from the age of fifteen months to six years.” 
Norms are provided for the following age levels: 15, 18, 21, 24, 30, 
36, 42, 48, 54, 60, and 72 months. For illustrative Purposes, the 
schedules at the two extremes are given below. 


also de- 


“Gesell and Thompson, op. cit., p. 218. m 
° Ibid., p. 219. This statement by 
liability., leaves much to be desired. 

10. A. Gesell, et al., The First Five Years o 
(Reprinted by permission. ) 


Gesell and Thompson, as evidence of re- 


f Life, New York: Harper, 1940. 


Gesell Developmental Schedules 


15-Month Level 
Motor: 


Walks: few steps, starts and stops 
Walks: falls by collapse 

Walks: has discarded creeping 
Stairs: creeps up full flight 
Cubes: tower of 2 

Pellet: placed in bottle 

Book: helps turn pages 


Adaptive: 
Cubes: tower of 2 
Cup and Cubes: 6 in and out cup 
Drawing: incipient imitation stroke 
Formboard: places round block 
Formboard: adapts round block promptly 


Language: 
Vocabulary: 4-6 words or names 
Jargon: uses 
Book: pats picture 
Picture card: points to dog or own shoe 


Personal-Social: 
Feeding: has discarded bottle 
Feeding: inhibits grasp of dish on tray 
Toilet: partial toilet regulation 
Toilet: bowel control 
Toilet: indicates wet pants 
Communication: says “ta-ta” or equivalent 
Communication: indicates wants (points or vocalizes) 
Play: shows or offers toy to mother or examiner 
Play: casts objects playfully or in refusal 


72-Month Level 


Motor: 
Jumps from height of 12”, landing on toes only 
Advanced throwing 
Stands on each foot alternately, eyes closed 
Walks length of 4 cm. board 
Copies diamond 


Adaptive: 


Builds 3 steps with cubes 
Draws man with neck, hands on arms, and clothes 


226 Scales for Infants and Preschool Children 


Draws man with 2-dimensional legs 

Copies diamond 

Adds 9 parts to incomplete man 

Discriminates 5 weights, no error 

Detects missing parts of pictures 

Repeats four digits 

Gives correct number of fingers on single hand 
and on both 

Adds and subtracts within five 


Language: 
Binet items used here 


Personal-Social: 


Ties shoe-laces 

Differentiates A.M. and P.M. 

Knows right and left or complete reversal 
Recites numbers up to the thirties 


Scoring. These schedules are not scored quantitatively, They are a 
general clinical guide intended for use in estimating the develop- 
mental status of a given child in respect to the four designated cate- 
gories of behavior. 


Validity and Reliability. In respect to validity and reliability, Gesell 
and his collaborators take the same position, presumably, as that 
quoted in connection with their other schedule; for no Statistical evi- 
dence is provided beyond percentages passing at the several age levels. 
“Application of the schedules is a simple matter of determining how 
well a child’s behavior fits one age level constellation rather than 
another, by the method of direct comparison. . . . There is nothing 
mathematical in this determination, neither is there anything mystical 
about it. It amounts to matching, which is neither calcul 


pee sh ation nor 
intuition; . . . Performance and developmental status are re- 
ported separately for each of the four categories of behavior in terms 


of the four approximate age levels. 


Evaluation. Inspection of these two developmental schedules reveals 


that each is a combination of some aspects of mental development 


NAL 


Minnesota Preschool Scale 227 


(as usually understood), motor development, sensory development 
and perception, and development of personal habits (often called 
social development). 

The schedule for infants (4 weeks to 56 weeks of age) has value 
for the experienced psychologist because it provides means, experi- 
mentally and clinically derived, of estimating, nonquantitatively, spe- 
cified aspects of a child’s development within the first year of life. But 
as psychological tests, this schedule does not satisfy the demands of 
standardization in terms of norms, reliability, and validity. The popu- 
lation sample was small (49 boys and 58 girls) and restricted (from a 
homogeneous middle-class background). It yet remains for some psy- 
chologists to subject the schedule to rigorous studies of reliability and 
validity before it can be used with considerable confidence for pre- 
dictive purposes. On the positive side, it can be said that superior ex- 
perimental techniques were used, and a great deal of careful observa- 
tion, experience, and behavioral insight went into the derivation of the 
developmental schedule. For these reasons, when applied by skilled 
observers, it is useful in appraising an infant’s developmental status 
as it appears at the time of the examination. 

For the reasons stated above, the second schedule of development 
and behavior (for ages 15 to 72 months) is of questionable value. 
For this group of children, especially those two years of age and older, 
there are other scales that have been standardized and that can be 
used with more confidence. With some of these, the reader is already 
familiar (e.g., the Stanford-Binet); others will be described in the 


following pages. 


MINNESOTA PRESCHOOL SCALE * 

This scale, in two forms, is an adaptation and restandardiza- 
tion of test items chosen from the earlier work of a number of psy- 
chologists, plus some original additions. It is designed for use with 
children from age eighteen months to six years. 

The scale includes the following twenty-six tests: pointing out parts 
of the body; pointing out parts in pictures, naming familiar objects; 
copying a circle, triangle, and diamond; imitative drawing (vertical 


12 F. L. Goodenough, J. C. Foster, and M. J. Van Wagenen. 
Butia ud aliona Test Bureau, 1932 and 1940. The revised test manual, 
1940, is by F L. Goodenough, K. M. Maurer, and M. J. Van Wagenen. 


228 Scales for Infants and Preschool Children 


and horizontal strokes and a vertical cross); block building; response 
to pictures; Knox cube imitation (tapping a series of cubes in a given 
order); obeying simple commands; comprehension ( What honid 
you do when you are hungry?” ); discrimination of geometric forms; 
naming objects from memory, recognition of forms, color naming; 
tracing a form; picture puzzles (object assembly); incomplete pic- 
tures; digit span; picture puzzles, diagonal series (more difficult object 
assembly); paper folding; absurdities (verbal); mutilated pictures; 
vocabulary; word opposites; imitating position of clock hands; speech 
(length of sentence spoken by child during examination). 


Scoring. The raw score is converted into what is known as a C-score, 
which in turn can be converted into an IQ equivalent by means of 
tables provided in the manual.” The Minnesota scale also provides 
for another type of score known as “percent placement,” which is 
defined as the percentage of the difference between the score of the 
most backward and the score of the most advanced child likely to be 
found in a representative group of a thousand children of similar age. 
Thus, if the lowest C-score in a given group is 50 and the highest is 
110, the range is 60 points. A child who gets a C-score of 65 is 15 
units, or 25 percent (15/60), above the lowest score in the group. 
His “percent placement” score, then, is 25. 

The norms of this scale are so arranged that it is possible to obtain 
three separate scores for children above thirty months of age: a verbal, 


a nonverbal, and a total score. For a child under thirt 


y months of age 
only the tot 


al score is used because the authors of the scale were un- 
able to work out a system of differentiated scoring for these earlier 


levels. A rough analysis is possible, however, to determine whether 
a pronounced difference between verbal and 


exists. If such a difference is found at any 
scale, then, as the case may be, handicap 
language or perceptual-motor ability may 


nonverbal responses 

age within the range of the 

or acceleration in respect to 
be inferred, 


18 The C-score “. . . represents the difficulty of the tasks with which [a child] 
may be expected to succeed in 50 percent of his trials” (Manual, p .91). It isa 
form of “absolute scaling,” the units of which, presumably, increase by steps 
that are approximately equal in difficulty. It is a variation on the familiar 
standard-score technique. For a more detailed description of the C-scores and 
the basis of determining IQ equivalents, see F. L. Goodenough and K. M. 
Maurer, The Mental Growth of Children From Two to Fourteen Years, Minne- 
-apolis: University of Minnesota Press, 1942, Chapter IV. 


Minnesota Preschool Scale 22 


Validity and Reliability. The Manual of the Minnesota scale does 
not provide data specifically designated as evidence of validity. We 
may infer, however, that the authors regarded the following facts as 
their basis of validity: (1) the adaptation and use of types of test 
items considered by many psychologists, over a period of years, to 
have validity; (2) a standardization group of 900 children, ranging 
in age from eighteen months to six years (100 in each of nine half- 
year age groups), who were balanced equally as to sex and whose 


TABLE 38 


Correlations Between Minnesota Preschool and 
Stanford-Binet IQ's 


Age in Months at Correlations 
Taking Minnesota * 1916 S.-B. 1937 S.-B. 
Under 36 45 21 
36-47 64 61 
48 and over 65 68 


* The number of cases in each group was large, ranging from 
141 to 841.) From Goodenough and Maurer, op. cit., Part II. 


fathers were representative of the distribution of occupational levels 
in the general population. 

In another and more recent publication, data relevant to the scale’s 
validity are available. Children who had been tested originally with 
the Minnesota scale at various ages during their preschool years were 
retested with the 1916 Stanford-Binet, in some instances, or with the 
1937 revision in others. The intervals between tests and retests varied 
from a few months to about ten years. 

When the 1916 revision was used in retesting and the results were 
correlated with the total scores of the Minnesota, the range of coeffi- 
cients for the various groups was from a low of .25 to a high of .75. 

When the 1937 Stanford-Binet was used in retesting, the correla- 


tions with original Minnesota scale total scores yielded coefficients 


from .15 to .76. : i KA À 
Table 38 shows the median correlations between original Minne- 


sota 1Q equivalents, found at various ages, and the retest Stanford- 
Binet 1Q’s at ages ranging from 44% to 13% years. 


11 Goodenough and Maurer, op. cit., Part II. 


230 Scales for Infants and Preschool Children 


If we accept the Stanford-Binet scales as significant criteria of va- 
lidity, as many psychologists have in actual practice, then we must 
conclude that the Minnesota scale has low validity below the age of 
thirty-six months, but that it has much greater validity for individuals 
above three years of age. 

In this connection two considerations must be kept in mind. First, 
it has been found that all scales devised for use with children below 
the age of eighteen months show a low or very moderate correlation 
with retest results in later childhood. The probable reasons for this 
fact will be presented later. Second, the correlation coefficients be- 
tween results of testing and retesting the same subjects with the same 
scale tend to decrease somewhat as the time interval between exami- 
nations increases, 

Reliability data of the Minnesota scale are variable at the different 
age levels. The coefficients between the C-scores on the two forms of 
the scale (that is, the test-retest method), with intervals of one to 
seven days, were: .68 to .94 for the verbal tests, .67 to .92 for the 
nonverbal tests, and .80 to .94 for the combined total scores, The 
average reliability coefficients for a single form, within an age-range 


of six months, were: .86 for the verbal, .82 for the nonverbal, and 
.89 for the total scores. 


CATTELL DEVELOPMENTAL AND INTELLIGENCE SCALE © 


This scale, of superior merit, covers the range from two to 
thirty months. Its test items are adaptations of many which were de- 
veloped and included in earlier tests, notably those of Gesell and his 
associates. Cattell states the scale “. . . has been so constructed as 
to constitute an extension downward of Form L of the Stanford-Binet 
tests. Between the ages of twenty-two and thirty months Stanford- 
Binet items are intermingled with other items. Thus, using the infant 
test items for the early months and the Stanford-Binet tests for the 
older ages with a mixture of the two between, one continuous scale 
from early infancy to maturity has been attained.” *° The test items 
are grouped at age levels as they are in the Stanford-Binet. Groupings 
are provided at each month from two through twelve; at two-month 
intervals in the second year; and at twenty-seven and thirty months. 


19 Psyche Cattell, The Measurement of Intelligence of Infants and Young 
Children, New York: Psychological Corporation, 1940, 
16 Thid; p: 24. 


Cattell Developmental and Intelligence Scale 231 


The following three age levels illustrate the nature of the items and 
their arrangement. 


Two months 
(1) Attends voice 
(2) Inspects environment 
(3) Follows ring in horizontal motion 
(4) Follows moving person 
(5) Babbles 
(Alt. a) Follows ring in vertical motion 
(Alt. b) Lifts head in prone position 


Ten months 


(1) Uncovers toy 

(2) Combines cup and cube 

(3) Attempts to take third cube 

(4) Hits cup with spoon 

(5) Pokes fingers in holes of peg board 
(Alt.) Picks up spoon before cup 


Thirty months 


(1) Differentiates bridge from tower 
(2) Imitates drawing lines and circles 
(3) Stanford-Binet three-hole form board rotated 


(4) Folds paper i : 
(5) Stanford-Binet identifying objects by use 


(Alt. a) Identifies pictures from name 

(Alt. b) Concept of one 

The scale was standardized by longitudinal testing; 1346 examina- 
tions were made on 274 children at the ages of three, six, nine, twelve, 
eighteen, twenty-four, thirty, and thirty-six months. 

In the process of standardization, it was Cattell’s purpose, among 
other things, to improve on earlier scales by: (1) improving objective 
procedures for administering and scoring; (2) eliminating items of 
the “personal-social” category, which are markedly influenced by 
home training; (3) eliminating items which are indicators of large 
motor control; (4) providing more accurate age-scaling; (5) provid- 
hile the scale was standardized only on these age 


ided at certain age levels between them. The 
andardization levels was estimated. Cattell 


states, however, that the indications are the scale may be used with only a little 
less accuracy with children between the standardization ages. At the same time, 
she urges the exercise of caution in interpreting test results at the ages between 


the standardization levels- 


1 It should be noted that W 
levels, groups of items are prov 
placement of items between the st 


232 Scales for Infants and Preschool Children 


ES 


a 


ric. 9.1, Complete set of material for administering the infant 


tests. From P. Cattell, The Measurement of Intelligence of In- 
fants and Young Children, Psychological Corporation. (By per- 
mission. ) 


ing an adequate age-ran 
studied; (6) providing a 
the age range covered. 


ge so that continuity of development can be 
more nearly equal distribution of items over 


FIG. 9.2. Regards Cube. Age 3 
Months, 

Material: A one-inch cube 
painted bright red. 

Procedure: As the child is sit- 
ting in an upright position before 
the table, the cube is placed on 
the table within casy view of him. 
The cube may be tapped on the 
table or moved about to attract 
the child’s attention. 

d observes the cube. His eyes must 
e examiner has removed his hand. 
ke sure that it is the cube and not 


From P. Cattell, The Measurem 
Young Children, Psychological Corp 


ent of Intelligence of Infants and 
oration. (By permission.) 


Cattell Developmental and Intelligence Scale 233 


FIG. 9.3. Picks Up Spoon. Age 5 
Months. 

Material: Teaspoon. 

Procedure: The spoon is placed 
directly in front of the child (sit- 
ting position) within easy reach. 

Scoring: Credit is given if the 
child makes a definite effort to 
teach for and pick up the spoon 
and succeeds, but if the spoon is 
picked up by reflex closure of the 
hand on chance contact, it is not credited. Accurate reaching, however, is 
not to be expected at this age. 

From P., Cattell, The Measurement of Intelligence of Infants and 
Young Children, Psychological Corporation. (By permission.) 


Scoring. The method of scoring is the same as that used with the 
Stanford-Binet scale. Each item is scored as either plus or minus; no 
Partial credits are given. Since there are five items at each age level, 
the credit given for each item passed is one-fifth of the interval cov- 


| Kaona ric. 9.4. Places Round Block in 
| : Form Board. Age 16 Months. 
Material: The form board is 
similar to Gesell’s. It is made of a 
three-eighths-inch_ board 36 x 16 
cm., stained dark green. Three 
holes are cut in the board equi- 
distant from each other and from 
the edges. From left to right the 
holes are a circle 8.7 cm. in di- 
ameter; an equilateral triangle, 
With sides 9.3 cm., and a square with sides 7.5 cm. The inserts are made 
of wood 2 cm. thick and painted a hes is a cm. in diameter, 
> side jangle 9 cm., and those of the square 7.3 cm, 
gi rome ae board is placed before the child with the circle 
‘on his left and the base of the triangle toward him. The circle is placed 
in its recess and the child is ren ks ome out, then he is asked (with 
s, r “Now put it back. 
eon eg gatuen) oye if the child replaces the round block. If it is 
done with an evidently purposeful act, one trial is enough, but if there 
is some doubt as to whether or not it was a chance replacemen t, no credit 
should be given unless it is placed a second time. (Credit is given for re- 
g Tock in the reversed board at cighteen months.) 


aa sion The Measurement of Intelligence of Infants and 


Young Children, Psychological Corporation. (By permission.) 


234 Scales for Infants and Preschool Children 
= 


ered by the particular series of tests. Thus, in a series of items spans 
ning a one month interval, each item passed carries credit of .2 ofa 
month; when the interval is two months, the credit per item is .4 of a 
month; with a three month interval, it is .6 of a month. The Cattell 
scale, like the Stanford-Binet, uses a basal age and sums up the cred- 
its at higher levels to obtain a mental age, and from that an IQ. 


Validity and Reliability. Although percent passing each item at suc- 
cessive ages was used as evidence of validity, the principal criterion 
was the correlation between Cattell scale IQ ratings, obtained up to 
the age of thirty months, and Stanford-Binet (1937) IQ’s obtained 


TABLE 39 
Validity Coefficients: Cattell and Stanford-Binet Scales *8 
Ages at 
No. Examinations Cocfficients 
42 3 mos. and 36 mos. 10+ .10 
49 34+ .08 
44 Rs 18 + 10 
Sy Ue) SI ea eh 56 + .06 
52 Tan RES SS Ee .67 + .05 
52 a a 71+ .05 
42 BO ry ES AE cee 83 + .03 


with the same children at the age of thirty-six months. Table 39 shows 
the validity coefficients. 

It is obvious that, accepting the Stanford-Binet as the criterion, the 
coefficients are practically negligible for tests given during the first 
nine months of life. In this respect they are much the same as other 
current scales. For the later ages, up to thirty months, the coefficients 
increase very appreciably and are on the whole superior to those 
found with most other scales designed for these age levels. 

In spite of the low predictive value of the coefficients at the earlier 
ages, Cattell has found, from study of individual cases, that the tests 
may be of considerable assistance to the clinician in appraising infants 
who are marked deviants from the norm. This is the case especially 
with infants who get a high quotient; for, Cattell reports, they have 


1 From Psyche Cattell, The Measurement of Intelligence of Infants and 
Young Children, New York: Psychological Corporation, 1940, p. 49. (By 
permission. ) 


Merrill-Palmer Scale of Mental Tests 235 


appreciably better than average chances of earning a high rating at 
the age of two or three years. 

Reliability of the scale was calculated by the odd-even procedure 
and corrected by the Spearman-Brown formula. Coefficients ranged 
from a low of .56 + .05 at the age of three months to a high of 
.90 + .01 at eighteen months. The median coefficient was .86 = .02. 
These coefficients compare favorably with those found for other 
scales. 


MERRILL-PALMER SCALE OF MENTAL TESTS * 


Although the norms of this scale are based upon 631 cases 
ranging in age from eighteen to seventy-seven months, its author does 
not recommend its use with children below twenty-four months or 
above sixty-three months of age. 

The scale consists of ninety-three items arranged in order of diffi- 
culty. There is no attempt to group the tests according to types of 
function or behavior involved. The age norm (called “age at par”) 
for each item is given, this being the age at which fifty percent of the 
children were successful. Although there are ninety-three items, there 
are only thirty-eight different items. Some (twenty-one) recur several 
times, at different age levels; at later ages a higher level of perform- 
ance is required (in terms of quality or quantity of response, or in 
rate of activity) if credit is to be earned. These are called “‘variable- 
score” tests by the author. Other items (seventeen) occur only once 
in the scale; they are called “all-or-none” tests. 

The scale tests some language (e.g, “What runs?” “What 
scratches?” These are known as action-agent tests. Also simple ques- 
tions, like “What does a doggie say?”); manipulation of the body 
(e.g., opposition of thumb and fingers, crossing feet); motor skills 
and coordination (e.g-, throwing a ball, buttoning); visual insights 
(e.g., building with blocks, copying a circle and a cross, completing 
form boards and picture puzzles); and recognizing familiar objects 


and colors. 


Scoring. A point or raw score is obtained first. This may be con- 
verted into one or more of several relative indexes: namely, (1) men- 


—— 


3 Rachel Stutsman, Mental Measurement of Preschool Children, Yonkers, 
N. Y.: World Book, 1931. 


236 Scales for Infants and Preschool Children 


tal age, (2) standard deviation on the basis of point scores, (3) 
standard deviation on the basis of mental ages, (4) standard devia- 
tion on the basis of IQ’s, (5) and percentile ranks on the basis of 
raw scores. It is suggested, however, that use of the IQ itself is in- 
advisable in connection with the Merrill-Palmer scale because its IQ 
deviations are not the same or close enough in size at the various age 
levels. 

Stutsman provides and suggests the use of a “Guide for Personality 
Observations” in connection with this scale. While these observations 
do not affect the scoring, they are, nevertheless, useful to the clinician 
in interpreting a child’s responses during the examination. The follow- 
ing traits are observed and rated, as they are manifested during test- 
ing: self-reliance, self-criticism, irritability toward failure, degree of 
praise needed for effective work, initiative and independence of ac- 
tion, self-consciousness, spontaneity and repression, imaginative tend- 
encies, reaction type (slow and deliberate, calm and alert, quick and 
impetuous), speech development, dependence on parent, and other 
observations. Value of these observations and ratings, obviously, will 
be dependent on the skill and experience of the examiner, These and 
similar observations, as already pointed out, are desirable, in fact es- 


sential, in the complete report on and evaluation of any individual’s 
test performance. 


Validity and Reliability. Criteria of validity were those generally 
used: (1) known groups, (2) ratings by nursery school staff, (3) 
small overlapping of distribution of total scores between age groups, 
(4) correlation with chronological age (r = .92 + 004), and (5) 
correlation with the Stanford-Binet (r = .79 + 019; 159 children 
in the standardization group, between 3 and 6 years of age). The 
correlation coefficient for the last criterion must be interpreted in the 
light of the fact that the age range was three years. 

In the guide describing the Merrill-Palm 
zation, no data on its reliability are provided, Subsequent studies, 
ay infer its reliability. 
n (ages 2 to 5 years) 
, she found a correlation coefficient 


Evaluation of Scales for Infants 237 


found a correlation coefficient of .92 between the scores of the first 
and second tests.*° 


EVALUATION OF SCALES FOR INFANTS AND PRESCHOOL 
CHILDREN 


Technical Problems. Time or speed scores should not be used 
in tests for these age levels. The measurement of rate of performance 
is inadvisable and can be misleading, for at least two reasons: (1) 
speed of performance has not yet become a motivating factor in very 
young children; (2) the shifting attention of children at these age 
levels can obscure their true levels of skill and insight. 

The grouping of test items according to types of activity, as in the 
Gesell schedule, has the advantage of readily indicating functions 
that are retarded and those that are accelerated in the case of the 
child being examined. While this kind of analysis is not so immediately 
apparent in an age scale, like Cattell’s, it is nevertheless possible. 

Since most of the usual validity criteria are not available in stand- 
ardizing tests for infants and young children, this technical problem 
can be solved only through longitudinal studies, following up the 
same individuals over a considerable span of years, and correlating 
early test performance with later acceptable criteria of validity. Some 
efforts have been made in this direction.” D Pn 

Determination of item and subtest reliability, within a scale, pre- 
sents difficulties too. If the odd-even method is used, the results can 
be affected by the fluctuating attention of the subject. If the test-retest 
method is used, the results can be affected by the irregularity of 


20 si » Intelligence of Preschool Children as Measured by 
the Pe Marr bo moony ot 2 Eaa m ea Tests, University of Iowa Studies in 

i 1938. 
aes Xo OTe Ww. Richards, “Studies in Mental Development: 
1. erformance on Gesell Items at Six Months and Its Predictive Value for 
Performance on Mental Tests at Two and Three Years, Journal of Genetic 
Psychology Vol. 52, 1938, pp- 303-325; II. “Analysis of Abilities Tested at the 
Age of Six "Months by the Gesell Schedule,” ibid., pp. 327-331; II. “Perform- 
ance of Twelve-Months-Old Children on the Gesell Schedule and Its Predictive 
Value for Mental Status at Two and Three Years, ibid., Vol , 1939, pp. 
181-19}. “Abilities of Infants During the First Eighteen Mont s,” ibid., Vol. 
39, pp. 299-318. Also, N. Bayley, “Consistency and Variability in the 
Troth of Intelligence ‘from Birth to Eighteen Years,” ibid., Vol. 75, 1949, pp. 

-196 


228 Scales for Infants and Preschool Children 
aS 


growth tempos, when the time interval is significant. Significance of 
the time interval varies with the age of the subject: the younger the 
child, the shorter the significant interval. The desirable procedure 
would be to make retests within a week. 

Although most scales for these early age levels extend upward be- 
yond the two and three year levels, many psychologists recommend 
the use of the Stanford-Binet from age two, because of its more ade- 
quate standardization. Since the Cattell scale is an extension down- 
ward of the Stanford-Binet, and since it overlaps with the latter, it is 
a sound alternate for the Stanford-Binet. The Merrill-Palmer has also 
been found to be quite useful to the age of three or three and one- 
half years. 


Uses. Psychological tests for children at these ages are used for two 
main purposes: (1) to determine a child’s developmental status, with 
respect to the functions being evaluated, at the time of examination; 
and (2) to predict, so far as possible, future developmental and in- 
tellectual status. Most psychologists agree that the first purpose is 
reasonably well satisfied. With regard to the second of these purposes, 
the infant scales, with one possible exception, used with subjects be- 
low the age of fifteen or eighteen months have not proved adequate; 
for when test ratings obtained in about the first year and a half of life 
have been correlated with subsequent ratings obtained with the Stan- 
ford-Binet and other scales, the correlations found were so low as to 
be negligible. In fact, even an occasional small negative coefficient 
has been found. Therefore, when a child is examined within the first 
eighteen months of life, for purposes of predicting his future mental 
development and status, little weight can be placed upon the numeri- 
cal rating, except in the cases of infants who deviate markedly from 
the average, in either direction. 

Cattell’s scale is superior to most others in this respect. The first 
three coefficients (.10 to .34) in the table showing validity coefficients 
of this scale are characteristic of those generally found for the first 
year of life. But the remaining coefficients of validity are higher than 
those found for other preschool tests. In general, predictive value of 
scales for preschool children increases after the age of eighteen 
months or two years. While the many correlational studies published 
on preschool groups aged eighteen months or more show coefficients 
varying over a considerable range, many of them fall in the .40’s, 


Evaluation of Scales for Infants 239 


.50’s, and .60’s, with relatively few others higher or lower.” In gen- 
eral, the higher the preschool age at the initial test, the higher will be 
the relationship of initial score to scores on retests. 

In one instance, at least, psychologists have been concentrating 
upon devising scales to test very young infants exclusively—between 
the ages of 4 and 36 weeks—when adoptions are made most fre- 
quently. These scales, known as The Northwestern Infant Intelligence 
Tests, emphasize the child’s adaptation to the physical and social en- 
vironment. Their ultimate value, too, will depend upon their dem- 
onstrated predictive efficiency.” 

In spite of their low predictive value for very young infants, avail- 
able scales are of assistance to an experienced clinical psychologist 
in appraising a child's behavioral and mental development when at- 
tention is given to analysis of performance on the various parts rather 
than to numerical scores alone, and when the analysis is used in con- 
junction with other clinical data. Developmental and intelligence tests 
for preschool children must be used with more than ordinary precau- 
tion; for the value of the findings is exceptionally dependent upon the 
skill of the examiner in eliciting the child’s best efforts and in being 
able to appraise his general behavior during the examination session. 

There are several reasons why results of tests in infancy and the 
earlier preschool periods do not have more value in predicting future 
mental status. First, resistance to examiner, shyness, failure to exercise 
Maximum effort, and other emotional conditions are undoubtedly 
Operative in some instances. More important and fundamental, how- 
ever, is the fact that there are changes and irregularities in the tempo 
of development of numerous young children. It has been found that 
successive examinations of individual infants show Heerlen: be- 
tween two or three levels; or they may show moma trend down- 
ward or upward before leveling off to variations within a relatively 
narrow range of ratings.” These fluctuations and trends in rate of 
development may be due to changes in mental organization; that is, 


e studies, see K. M. Maurer, Intellectual 
Stat M: ity as a Criterion Se tE in: Rresoltaol Tests, Minne- 
us at Maturity as be 1946, apter 2. 
apolis: University of cee d ee Northwestern University, Evans- 
* By A. R. Gilliland an 
ton, Ill. š 
% See Cattell, op. cif» 52 ft; alà w e 
Children,” Thirty-ninth Yearbook, T blishing Co., 
tion, Bloomington, Ill; Public Schoo u 


thesi 


“Mental Growth in Young 
for the Study of Educa- 
1940, Part II, pp. 11-47. 


240 Scales for Infants and Preschool Children 


differences in the age of appearance of various functions and eee 
in their rates of development—appearance of new functions = 
changes in rates being especially rapid in the first two years of life. 

Closely allied to the foregoing is the fact that tests included in in- 
fant scales are dissimilar to those used at later age levels, so that 
little correlation is to be expected. The reader will have observed 
that tests used in appraising an infant’s development in the first eight- 
een months of life are largely of relatively simple motor activities and 
of sensory perception. These have never been found to correlate sig- 
nificantly with tests used at later age levels, which increasingly involve 
the higher and more complex mental functions. (See the çhapter on 
definitions and nature of intelligence.) It may be that psychologists 
will not be able to devise infancy tests having greater predictive value; 
for it seems that those functions which are subsumed under the term 
“intelligence” do not reach a measurable magnitude until an 
than infancy. That is, intelligence, as psychologically unders 
defined, does not emerge sufficiently durin 
velopment. 


In conclusion, then, it may be said that when tests are used with 
children below the age of eighteen months, emphasis should be placed 
upon an analysis of performances and their evaluation of the child’s 
present status. Thereafter, the scales increas 
of predicting later mental level. Bayley,” after surveying her own and 
other researches, concluded that tests given between two and four 
years of age will predict eight- and nine-year intelligence test per- 
formance with moderate success (r = .55); while tests given at four 
years of age will predict eight- and nine-year performance much 
more satisfactorily (r = -75). These conclusions, however, were writ- 
ten before the publication of Cattell’s Scale und its validity data. It 
appears, therefore, that, as Cattell has shown, it is possible to devise 
scales of significantly higher predictive value for use with preschool 
children who are above eighteen months of age.” 


25 N. Bayley, op. cit., pp. 16 ff. 

= It would be highly desirable to h 
devote more time, energy, and money 
lems in this area, and perhaps to dive 
now being lavished on the study of per: 


age later 
tood and 
g the earliest phase of de- 


e in value for the purpose 


ave more psychologists and institutions 
to research on the very important prob- 
Tt some of the time, energy, and money 
sonality and projective methods. 


IO. 


Amummmmunuunuununussiunimununiumnuununuunuuunumnuuuuw 


NONVERBAL GROUP SCALES OF 
MENTAL ABILITY 


BEGINNINGS 


The original Binet scale and its several revisions are adminis- 
tered to one person at a time; hence they are called individual scales. 
This is true, also, of the performance scales already described. Indi- 
Vidual scales, obviously, are time-consuming and require that the 
examiner be highly skilled in administering them, in interpreting 
Tesponses, and in evaluating the subject’s behavior during the course 
of the examination. Impelled, perhaps, by the prevailing urge for 
“efficiency” and mass production, and by a desire to investigate large- 
Scale problems, American psychologists undertook to develop tests 
which could be administered to a group of persons—large or small 
—all at one time. 

World War I provided the occasion for the organization of the first 
group test. Prior to 1917, psychologists had been experimenting with 
test items and organization with a view to group examination. Shortly 
after the United States entered World War I, a psychological branch 
Was formed in the army in order to develop and use group scales for 
the purpose of general classification of soldiers on the basis of mental 
ability, so far as the tests might measure that trait. A few other devices 
Were developed for use in the army—e.g., trade tests—but in the army 
of World War I the work and contribution of psychologists were very 
largely in the testing of general ability. 

Although the 1916 Stanford Revision and the Yerkes Point Scale 


242 Nonverbal Group Scales of Mental Ability 


were employed to some extent, as well as an individual performance 
scale, the main task in the army was one of testing very large num- 
bers of men in a short space of time. Consequently, the Army Alpha 
scale (verbal) and the Army Beta scale (nonverbal) were organized, 
both being group scales. These were actually the product of the con- 
tributions of individual psychologists, notably Arthur S. Otis, who 
pooled their experience, experimental results, and resources. 

About 1,750,000 men were tested in the army of World War I. 
Though the scales were by no means highly satisfactory instruments, 
and though the men were very often examined under unfavorable 
conditions, the results obtained were of some assistance in the selec- 
tion of men for advanced or special training, on the one hand, and 
of men of such inferior ability as to be unsuited for military training, 
on the other. 

The use of psychological testing in the army of World War I had 
many outgrowths, some of which were, no doubt, unforeseen by the 
psychologists themselves. The data were reported and analyzed in a 
huge volume.’ On the basis of these data, many periodical articles 
and books appeared on such subjects as racial and national differ- 
ences in intelligence, geographic differences in intelligence within 
the United States, differences between occupational groups, relation- 
ship between educational status and intelligence, and the general in- 
tellectual level of the American adult. Not only were many of these 
data of doubtful validity but some of the interpretations and publica- 
tions based upon them gave rise to serious misapprehensions in regard 
to the foregoing problems, which are loaded with social and educa- 
tional implications. Another result of psychological testing in the army 
was the impetus it gave to the development of group tests for civilian 
purposes, notably in educational work at all levels, from kindergarten 
through university. Also, it set a precedent, for in World War II 
psychological testing was conducted on a vast scale in all departments 
of the armed forces. 

The types of test materials included in the army group scales and 
in the numerous group scales subsequently developed were not all 
innovations. For example, tests of memory, sentence completion, free 
and controlled word association, arithmetic computation, vocabulary, 
classification of objects, and following directions had been in process 


1 Yerkes, R. M., editor, Psychological Examining in the United States Army, 


Memoirs of the National Academy of Sciences, Washington: Government 
Printing Office, 1921, Vol. 15. 


Characteristics of Group Tests of Mental Ability 243 


of experimentation beginning within the last twenty-five years of the 
nineteenth century, in the United States and in several European coun- 
tries. 


CHARACTERISTICS OF GROUP TESTS OF MENTAL ABILITY 


With a few exceptions, group tests—implicitly or explicitly— 
are constructed on the principle that intelligence is a general capacity 
and that it should be measured by means of sampling a variety of 
mental activities. Inspection of the scales shows, therefore, that they 
include in various combinations such items as following directions, 
arithmetical problems, practical judgment (in connection with “com- 
mon-sense” problems), word meaning, disarranged sentences, comple- 
tion of number series, completion of sentences, verbal analogies, in- 
formation, mazes, three-dimensional visualization and counting of 
cubes, symbol-digit combinations, picture absurdities, picture arrange- 
ment, geometrical construction (“‘paper-form-board”), and geometric 
pattern analogies. Samples of each of these will be presented later 
when illustrative scales are described. 

In most group scales, the items of each type (e.g., number series) 
are placed together in separate subtests or parts, beginning with the 
easiest and progressing by intervals—as nearly equal as may be 
achieved—to the most difficult. The principle involved here is this: 
by means of such an arrangement of items, every individual for whom 
the test is intended should be able to get some items correct and to 
proceed to a level of difficulty which represents his maximum in that 
particular type of mental activity. 

Occasionally, however, it will be found that items in a scale are 
arranged in “spiral omnibus” fashion: that is, items of various types 
are presented in regular or irregular order instead of being grouped 
separately in subtests. Thus, there may bea sequence of this kind: one 
item each in number completion, arithmetical problem, vocabulary, 
information, analogies, etc.; then the types of items, increasing in 
difficulty, will be repeated in the same or in a different order. 

Every group scale is standardized for a specified range of ages or 
school grades.” Thus the particular types of items used and the levels 


2 Standardization for a grade range is, in practice, tantamount to standardiza- 
tion for an age range because the scale must be adequate for the spread of ages 
ordinarily located within those grades. Furthermore, even when a scale is spec- 
ified for certain school grades, it provides tables of norms for the various age 


groups. 


244 Nonverbal Group Scales of Mental Ability 


of difficulty within a scale will depend upon the group for which it 
is intended. For instance, a group scale designed for children from 
kindergarten through the second grade will be almost entirely non- 
verbal in character, except for directions; one designed for pupils in 
the intermediate grades will include an increasingly larger portion of 
abstract and conceptual items (verbal and numerical); while tests of 
intelligence for high-school pupils and college freshmen are very 
largely, some entirely, of the verbal and numerical kind. 

On many group scales, an individual’s score is first obtained in 
terms of the number of points earned; that is, a raw score. From a 
table of norms, this score is converted into a mental age, from which 
an intelligence quotient is calculated. The manuals of some group 
scales provide, also, tables necessary to find an individual’s percentile 
rank for his age or grade, or both. Other group scales dispense with 
mental ages and intelligence quotients, and give only percentile- 
rank equivalents for the range of point scores. 

Group scales are scored more rigidly and more objectively than 
individually administered scales, such as the Stanford-Binet.* In the 
former, the correct response, or responses, are supplied for each item 
so that they can be scored by clerks or machines. In the case of the 
Stanford-Binet and similar scales, while specimens of satisfactory and 
unsatisfactory responses are supplied, it is frequently necessary for the 
examiner himself to evaluate some responses and to decide whether or 
not credit should be given for them. This necessary exercise of judg- 
ment, however, does not invalidate the scales; for correlational studies 
have shown that there is very close agreement between experienced 
examiners as to the scoring of given responses. 

Most group scales impose time limits for each of the several sub- 
tests, or parts. Whether this fact makes a scale a test of speed of re- 
sponse, solely or largely, or whether the scale measures “power” (level 
of difficulty the individual is capable of reaching) is a question 
to which answers have been provided by experiment. The imposition 
of time limits does not necessarily make a scale a test of speed of per- 


formance; the significance of the speed factor, in affecting the total 
score of a person, varies with the scale used. 


3 This does not mean that the group scales are therefore better instruments- 
In fact, inflexible scoring is a disadvantage if the examiner's purpose is to study 
an individual clinically and to analyze test results qualitatively as well as quan- 
titatively. 


Pintner-Cunningham Primary Test 245 


Some group scales are entirely nonverbal in content; others are 
entirely verbal; while still others combine the two types of items. In 
this chapter we shall describe several representative scales of the non- 
verbal variety. 


PINTNER-CUNNINGHAM PRIMARY TEST‘ 


This scale (having alternate forms, A and B) is intended for 
children in kindergarten, grade 1, and the first half of grade 2. It con- 
sists of seven subtests: common observation (identifying objects com- 
monly found in the usual environment); perception of esthetic dif- 
ferences; identification of associated objects (knowledge of relation- 
ships between objects, such as key and lock); discrimination of size; 
perception of the elements that constitute a whole picture; picture 
completion (noting missing parts in a picture and, from a series of 
choices, indicating the missing part); copying designs (using a given 
Square of dots). 


az C P| 


Associated Objects 


aie AHE: 


Picture Completion Dot Drawing 


ric. 10.1. Items from Pintner-Cunningham Primary 
Test. Copyright 1938 by the World Book Company. 
(Reproduced by special permission.) 


Mental age norms are provided from 4 years and 1 month to 10 
years. 


Validity and Reliability. As in the case of many other group scales, 
validity of the Pintner-Cunningham test is given in terms of its cor- 


* By R, Pintner, B. V. Cunningham, and W. N. Durost. Published by World 
Book Company, 1938. 


246 Nonverbal Group Scales of Mental Ability 


relation with the Stanford-Binet (1916), the reported coefficients 
i 88. 

ag dre tie in terms of correlations between the alternate 
forms (A and B) and in terms of the probable error of test scores. 
The reliability coefficients vary from .83 to 94. The standard devia- 
tion of scores is reported as being about five times as large as its 
probable error; hence this criterion of reliability may be considered 
satisfactory, since the minimum acceptable relationship between a 
standard deviation and its probable error is usually given as three 
to one. 


CHICAGO NONVERBAL EXAMINATION ° 


This scale is designed for subjects from six years of age through 
adulthood; * but it is very doubtful whether it is at all adequate for a 
representative sampling of persons above age thirteen. The authors 
state that the scale has proved clinically useful for children between 
seven and thirteen years of age. In the case of persons above age 
thirteen, it appears that the tests measure speed of performance. 

The types of items included are not unique: symbol-digit; percep- 
tion of similarities and classification of objects (by crossing out the 
one object that is of a different class); three-dimensional visualiza- 
tion and block counting; “paper-form-board” (marking the parts that 
would make up a whole geometric figure); visual perception of detail 
(matching geometric designs which have more or less internal de- 
tail); picture arrangement (numbering parts of wholes, to indicate 
how the parts must be placed to form the whole); numbering pic- 
tures in their correct order to represent a logical sequence of events; 
picture absurdities (noting the missing or superfluous parts); picture 
matching by relating a part to a whole picture; symbol digit (more 
complex and more extensive than the first symbol-digit test). 


Validity and Reliability. The reliability coefficients reported are be- 
tween .80 and .90. Some of these coefficients are within the range 
usually regarded as satisfactory. But they were calculated for age 


è The probable error of the standard deviation indicates the extent of fluctua- 
tions in scores to be expected as a result of random errors of measurement. 

ĉ By A. W. Brown. Published by The Psychological Corporation, 1936. 

7 The scale may be administered by using verbal directions or by means of 
pantomime. Pantomime, however, cannot be used successfully with children 
younger than eight years. 


Chicago Nonverbal Examination 247 


ranges of two and three years and extend over two to six school 
grades, whereas reliability coefficients for each chronological age 
group and each school grade would be more valuable, if a child’s per- 
formance is to be compared with others of his own age or grade. 


Block Counting 


OIV © = ao 


"Paper Form Board” 


za | E3 Gd a a a SE 


Matching Figures 


Picture Sequence 


ric. 10.2. Items from Chicago Non-Verbal Ex- 
amination. Psychological Corporation, (By per- 
mission. ) 


Picture Arrangement 


The criteria of validity used in standardizing this scale were the 
usual ones: (1) correlation with CA, (2) known groups (in this in- 
stance, the mentally retarded), (3) symmetrical bell-shaped distribu- 
tion, and (4) correlation with results of other tests. 

The reported coefficients of correlation for test scores and chrono- 


248 Nonverbal Group Scales of Mental Ability 


logical age varied from .57 to .81, using groups having an age range 
of seven or eight years. 

Employing the second criterion, the authors of the scale found an 
average IQ of 61 (S.D. = 12) for 99 feeble-minded children, as com- 
pared with an average of 62 (S.D. = 6) obtained with the Stanford- 
Binet. The mean difference between the Stanford-Binet and the Chi- 
cago Nonverbal ratings, however, was 9.0 points (disregarding signs; 
correlation coefficients were not given). Thus, while the two means 
are practically identical, the standard deviations and the mean of the 
differences are such as to indicate that in many individual instances 
there were significant discrepancies between ratings obtained on the 
two scales. These discrepancies are most probably attributable to the 
differences in content of the scales; and they suggest, further, that the 


two scales may be used to supplement each other, rather than as sub- 
stitutes, when an individual’s mental abilities ar 
evaluated. 


In respect to the third criterion, the authors report a satisfactory 
distribution of scores, closely approximating the symmetrical curve. 
Like Terman and many other psychologists, they believe that measured 
intelligence should be and is naturally distributed in the 


form of the 
“normal frequency surface”; hence they use that curve as a criterion. 


Interestingly enough, excepting the mentally deficient, the results of 
the Chicago Nonverbal Examination have been compared with re- 
sults of two other group tests that are largely verbal in content (Otis 
and Kuhlmann-Anderson), but not with those of the Stanford-Binet 
which, by most specialists, is considered to be a most valuable cri- 
terion in standardizing group tests. With chronological age not held 
constant, the coefficients fell between .57 and .74; but with CA con- 


stant, the coefficient was only .51. These correlations represent only 
modern correspondence. 


e being analyzed and 


REVISED ARMY BETA EXAMINATION ° 


The original Army Beta scale was constr 
vide an examination for illiterates and non-English-speaking men in 
the Army of World War I. It included the familiar mazes, cube 
analysis and counting, symbol-digit combinations, pictorial comple- 
tion, and geometric construction. There are two other subtests with 


` By C. E. Kellogg and N. W. Morton. Published b 
poration, 1935. 


ucted in order to pro- 


y The Psychological Cor- 


Revised Army Beta Examination 249 


which the reader may not yet be familiar: the first is the X-O series, 
which includes a number of arrangements of the letters X and O; 
each series has a number of blanks that are to be filled in according 
to the arrangement of the given sequence. For example: 


X|x]o[X/|x]0]x]X]O[X]x]o| BE 


The other is a number-checking test, which consists of a series of 
paired numbers, from short to long. The subject is required to check 
those pairs which are identical. For example: 


650 650 
659012534 659021354 


Fic. 10.3. Items from Revised Army Beta Examination. Psychological 
Orporation. (By permission.) For No. 6 the testee marks the “picture 
absurdity”; for Nos. 15 and 16 he indicates the “missing parts,” 


The Revised Beta Examination, also, is intended to serve as a 
Measure of general ability in the case of “relatively” illiterate or non- 
English-speaking subjects. The types of subtests of the two scales— 
Original and revised—are not identical; for in the revised scale, the 
cube counting and the X-O tests have been omitted, while picture 
absurdities and object similarities and differences have been added. 

Norms are provided in terms of Stanford-Binet (1916) mental age 
equivalents, the range being from MA 6 years end 3 months to 16 
years and 8 months. 


250 Nonverbal Group Scales of Mental Ability 


Validity and Reliability. The scale’s reliability coefficients were high 
when odd-even items were correlated, r being .987. But the coefficient 
for test-retest scores was .77. 

Validity of the revised Beta scale is given in terms of the correla- 
tion between its point scores and Stanford-Binet (1916) mental ages, 
the coefficient being .78 = .02. When the revised Beta point scores 
were correlated with the Otis Self-Administering Tests of Mental 
Ability (higher examination A), the coefficient found was .71 = .02. 
Since these coefficients are based upon results obtained with groups 
of subjects whose ages varied, the correlations must be interpreted 
with due regard for the fact that they would be lower if chronological 
age had been held constant. What this signifies, then, is that this non- 


verbal scale may be used to supplement verbal scales; but it should 
not be used as a substitute for them. 


PINTNER NONLANGUAGE SERIES: INTERMEDIATE TEST” 


This scale, devised for use from grade 4 through grade 9, dif- 
fers from most other group scales of general ability in that it utilizes 
no verbal situations and is independent of word knowledge and lan- 
guage facility except as these are involved in understanding directions. 
The author also provides directions for administering the scale in 
pantomime where this is desirable, as in the case of subjects who suffer 
from language handicaps or from defective hearing. 

The tests consist entirely of materials of a diagrammatic nature, 
intended to provide “relatively independent” measures of the “spa- 
tial factor,” “perceptual ability” (visual), and “reasoning” (without 
use of language). This Pintner scale, therefore, is one of the few 
which specifically utilize some of the “factors” presumably isolated 
by those adhering to the group-factor theory of intelligence. Yet, in- 
terestingly enough, though tables for separate standard score ratings 
are provided for each of the subtests, no attempt is made to indicate 
which of the presumed factors is being measured by means of each 
of the six subtests. In fact, the author states that, “No claim is made 
that the subtests tap primary or independent abilities, and little is 
known as to the significance of the separate subtest scores. . . - 
Only large deviations [from the median] should be given any credence 
in guidance of individuals.” ° Apparently, then, the inference that this 


’ By R. Pintner. Published by World Book Company, 1945. 
10 Manual of Directions, pages | and 9. 


Pintner Nonlanguage Series: Intermediate Test 251 


scale measures the spatial factor, perceptual ability, and reasoning 
rests entirely on an a priori basis rather than upon scientific analysis. 

The total scores of the scale, however, are regarded as having va- 
lidity; for tables of norms, from age 7 to age 17 are provided, from 
which mental ages, intelligence quotients, and percentile ranks may be 
obtained. 


ric. 10.4. Items from Pintner General Ability Test 
—Intermediate. Copyright 1938 by World Book 
Company. (Reproduced by special permission.) At 
top are Reverse drawings: after looking at the sample 
pair to the Ieft of the bar, the subject looks at the 
first drawing to the right of the bar and finds its re- 
verse among the next three drawings. For the right 
middle series, Movement sequence, he finds which 
of the four drawings at the right completes the series 
started at the left of the bar. For the bottom series, 
Paper folding, he indicates which of the items at the 
right of the bar shows the way the folded paper at 
the left of the bar would look if it were opened up. 


Validity and Reliability. Reliability of the scale is reported in the 
usual terms of correlation coefficients between scores on odd-num- 
bered items and scores on even-numbered items, these being .85 
(Form K) and .89 (Form L). 

As criteria of validity of his nonlanguage scale, Pintner used in- 
crease of mean scores in successive age-groups, correlation with his 
Verbal tests of ability, and similarity (judged by observation) between 
his tests and those devised by factorial analysts. The increases in 
mean scores are moderate; and the correlation coefficients (second 
criterion) are in the .60's, for groups whose age range is only one 


2 Nonverbal Group Scales of Mental Ability 


„ar. Although these coefficients indicate a significant amount of re- 
tionship between the verbal and nonverbal tests, they are still far 
ough removed from unity to warrant again the conclusion that these 
vo types of tests may not be used interchangeably but can be used 
) provide a fuller understanding of an individual’s abilities, particu- 
rly in those instances where a person is handicapped by language 
isability, or where it is desired to discover the existence of a specific 
ype of disability (e.g., in visual perception and discrimination) of a 
1onverbal kind in the case of a person who has no language handicap. 


NONLANGUAGE MULTI-MENTAL TEST * 


This scale (in two forms, A and B) employs only a single type 
f item throughout. Each item consists of drawings of five objects, 
‘our of which “belong” on the basis of some common relationship, 
vhile the fifth does not “belong.” The testee has to identify and mark 
the non-belonging item. 


“3900 m 


“lL | 


ric. 10.5. Recognition of Relationships. Items from N 
z on-Language 
Multi-Mental Test, Teachers College, Columbia University. (By per- 
mission. ) ‘ 


The authors state that their scale “. . . is constructed to measure 
the ability to recognize and utilize relationships not among verbal 
symbols, but rather among pictorial symbols.” ° 

Norms are provided for mental-age equivalents and grade equiva- 


1 By E. L. Terman, W. A. McCall, and I. Lorge. Published by Bureau of 
Publications, Teachers College, Columbia University, 1942. 
12 Manual of Directions, p. 1. 


Pattern Perception Test 253) 


lents; in the former instance from a mental age of 33 months to one 
of 236 months; in the latter instance, from mid-first grade to mid- 
eighth. 


Validity and Reliability. Two common criteria of validity were used: 
correlation with chronological ages (r = .65) and correlation with 
mental ages as derived from a variety of group tests of intelligence 
(r = .67). 

Reliability coefficients of the scales, for a grade range of 3 to 8, 
were .86 for Form A, .90 for Form B, and .94 for both forms com- 
bined, Reliability coefficients for each of the several grades taken 
separately varied from .66 to .74, for each form taken alone. When 
both forms were combined, estimated reliability was .80. 


PATTERN PERCEPTION TEST ™ 

This test, like the one immediately preceding, employs a sin- 
gle type of material. But it is much more complex, its solutions require 
more subtle perceptions and reasoning, and it is intended for use only 
with adults. 

The scale includes sixty-four items (or problems). Each item con- 
sists of a row of five designs, the problem being to discern the four 
designs which form a pattern and to cross out the extra or inappro- 
priate one. There are eight sets of items, each set placed in the order 
of increasing difficulty. In each set of items the problems begin with 
an elementary presentation of a theme, or pattern, which is developed 
with increasing complexity in subsequent items. 

Although the Pattern Perception Test is still in process of standardi- 
zation and development, it is described here because it represents a 
type of nonverbal material that may prove to be very valuable for use 


13 Prepared under the direction of L. S. Penrose. Published by Galton Lab- 
oratory, University of London, 1947. This type of test was originally planned 
by Penrose and Raven; later Raven standardized a form principally for indi- 
vidual testing. During World War II, a short form for group use was prepared 
by Raven for use in the British army. See L. S. Penrose, “An Economical 
Method of Presenting Matrix Intelligence Tests,” British Journal of Medical 
Psychology, Vol. 20, Part 2, 1944, pp. 144-146. For background of the present 
test, see L. S. Penrose and J. C. Raven, “A New Series of Perceptual Tests, 
Preliminary Communication,” ibid., Vol. 16, 1936, pp. 97-104; J. C. Raven, 
“The R.E.C.I. Series of Perceptual Tests: an Experimental Survey,” ibid., Vol. 
18, 1939, pp. 16-34; P. E. Vernon, “Research on Personnel Selection in the 
Royal Navy and the British Army,” The American Psychologist, Vol. 2, 1947, 


pp. 35-51. 


254 Nonverbal Group Scales of Mental Ability 
5 


E ETEEN 
w er Ae ae ae a 
a PL LEPA 
o PPPS = 
> HANAN 
Co o a E 
7 XOOX® 


(8) ie, |) a 


ric. 10.6. Items from Penrose Pattern Perception Test. Galton Labora- 
tory, University of London. (By permission. ) 


with a wide range of adult ability. This test has been found, for ex- 
ample, to have value in identifying men at both the lower and the 
upper ends of the distribution of mental ability. It may prove to be 
valuable, also, in examining psychotic persons (or those suspected of 


Progressive Matrices Tests 255 


being psychotic); for reports indicate that these persons perform rela- 
tively poorly on items requiring judgment (insight) and constructive 
thinking. 

Statistical data thus far reported show test-retest reliabilities to vary 
between approximately .80 and .90. Validity coefficients determined 
by correlating the test scores with job ratings in the British navy and 
army varied widely, due to the degrees of reliability of rating criteria 
and the effects of selectivity in various jobs. The mean validity co- 
efficients for each of a number of navy and army branches varied from 
-30 to .47. Correlations with other standardized tests range from a low 
of .43, for 67 medical students (a fairly homogeneous group) to .73, 
for a random sample of 597 men in the British army. 


PROGRESSIVE MATRICES TESTS " 

These are nonverbal scales designed to evaluate the subject’s 
ability to apprehend relationships between geometric figures and de- 
Signs, to perceive the structure of the matrix and of the figure (part) 
necessary to complete each system of relations (the matrix) presented. 
The tests, thus, are intended to evaluate the person’s ability to dis- 
cern and utilize a logical relationship presented by these nonverbal 
materials. The problems require, in varying degrees, analytical and in- 
tegrating operations of the kind called “insight through visual survey.” 
Verbalization and abstraction of relationships are also possible fac- 
tors, if the subject is able to analyze and synthesize by these means. 
Factorial analysis suggests that the matrices tests are measures largely 
Of a “general factor,” with a small loading of a spatial perception fac- 
tor. Raven, the author of these tests, interprets this factor as being es- 
Sentially the same as Spearman’s eduction of relations and eduction 
of correlates. 

There are several sets and editions of the matrices tests, each in- 
tended for a specified age group or for limited ability levels. Thus, one 
Set is for children between the ages of 3 and 10, and for mental de- 


* Prepared by J. C. Raven. Published by H. K. Lewis & Co., Ltd., London. 
See by Raven, “Standardization of Progressive Matrices, 1938,” British Journal 
of Medical Psychology, Vol. 19, Part I, 1941, pp. 137-150: “Progressive Ma- 
trices,” published by The Crichton Royal, Dumfries, Scotland (1938-1951). 
Also, G. Keir, “The Progressive Matrices as Applied to School Children,” Brit. 
ish Journal of Psychology (Statistical Section), Vol. 2, 1949, pp. 140-150; 
G. A. Foulds and J. C. Raven, “An Experimental Survey with Progressive Ma- 
trices (1947),” British Journal of Educational Psychology, Vol. 20, 1950, pp. 


104-110, 


2:6 Nonverbal Group Scales of Mental Ability 


fectives; another may be used with children from age 6 and with 
adults; a third is devised for use with only the highest quarter of the 
population. 


Example 2 


ric. 10.7. Specimen Items from Raven Progressive 
Matrices Test. (By permission.) 


Validity and Reliability. While a fair amount of research has been 
carried out regarding reliability and validity, much more remains to 
be done in these respects. Reliability has been found to differ at the 
several age levels; and while validity has been suggested in terms of 
a general factor, the test’s predictive efficiency—that is, their practical 
validity—needs further investigation. Even so, this type of instrument 
is a very valuable addition to available nonverbal tests, particularly 
since it shows considerable promise for use with adults at the superior 
as well as average levels. Progressive matrices, as a type of test mate- 
rial, and the pattern perception type also, merit intensive experimental 
study, in themselves, and experimental comparison with the types of 
materials commonly included in nonverbal tests developed and widely 
used in the United States. At the present time, both of these instru- 
ments have progressed sufficiently to warrant their use as valuable 
supplements to the scales currently employed in both nonclinical and 
clinical situations; for these tests have been found useful with and 
interesting to individuals of a wide variety of ages, ability levels, and 
degrees of stability (and instability). 


CATTELL CULTURE-FREE TEST 


This is an attempt to provide a measure of general mental 
ability free from verbal materials and from “the acquired skills of 


Cattell Culture-Free Test 257 


most performance tests.” * It consists of six parts. The first, a classi- 
fications test, requires the subject to identify in each row of six figures 
the two that do not belong with the others. In the second, called 
“pool reflections,” the subject identifies the one of six drawings which 
represents the specimen drawing as it would appear in a pool image. 
This is a test of spatial perception. The third is a completion test in 
which the testee identifies from among six specimens the one draw- 
ing that will complete a series of four members. Completions vary 
from simple matching to the eduction of correlates. The fourth, fifth, 
and sixth parts are called matrices. In these, the subject has to identify 
the last member of a series of four or nine parts, the patterns being of 
increasing complexity. Some of the sequences are horizontal; some 
are vertical; some involve rhythms or cycles. The last of the subtests, 
the sixth, is further complicated by the fact that several sections of the 
matrix are missing, thus making it more difficult to discern the pattern. 


Validity and Reliability. Reliability is indicated by a split-half co- 
efficient of .88 (corrected by the Spearman-Brown formula), obtained 
with a group of 121 high-school freshmen. 

To begin with, this scale was standardized on about one hundred 
boys in a junior vocational high school and the same number in an 
academic senior high school. In retaining or rejecting an item, the cri- 
terion was its ability to discriminate between the two groups, the 
academic school pupils being regarded as the more able. (Note the 
assumption here that vocational school pupils are mentally less able 
than the other group. The differentials, presumably, are not regarded 
as due to the age differences.) The retained items were further 
screened on the basis of responses obtained from college students and 
from pupils in grades seven and eight. Finally, an analysis of items was 
made on the basis of highest-scoring and lowest-scoring individuals in 
a group of two hundred students doing major work in psychology. 

Although this is called a test of general ability, there is a low com- 
munality of functions measured, as judged from the intercorrelations 
of the parts. These run from .12 to .63, with a median of .38. The 
correlations of each of the part scores with total scores varied from 
-55 to .82, the median being .74. These coefficients, however, are 
Spuriously high because the score of each subtest was included in 


15 By R. B. Cattell. Published by The Psychological Corporation, 1944. 


Nonverbal Group Scales of Mental Ability 


Bsucser 


Classification 


ae! 


Matrices 


Fic. 10.8. Items from Cattell Culture-Free Test. Psychological Corpora- 
tion. (By permission.) 


Goodenough Drawing Test 259 


the total score; the coefficients, therefore, are in part indicative of 
self-correlation. 

When results of the Culture-Free Test were correlated with scores 
on the Modified Alpha Examination, Form 9, the median coeffi- 
cients for four groups (N = 376) were about .50 with Alpha verbal 
and numerical scores taken separately, and about .55 with the total 
Alpha scores. The Culture-Free Test was also correlated with the 
Stanford-Binet, the American Council, and the Arthur scales, and 
with the Ferguson Form Board results." The mean coefficient with 
these was .52. 

The foregoing coefficients would be considered too low as evidence 
of validity, if the other intelligence tests were taken as basic criteria. 
Cattell and his collaborators apparently do not do so; they offer these 
data, it seems, to indicate the extent of agreement between their test 
and the others. Basically, however, their criterion is one of factor 
validity based upon acceptance of the general-factor theory and upon 
factor analysis of their data. Their view is essentially this: Devise a 
sampling of tests which have been accepted by psychologists as test- 
ing intelligence—in this instance nonverbal perception of relationships 
and spatial perception; that is, those tests demanding insight into com- 
plex relations, ability to learn, ability to think abstractly, and ability to 
cope with new situations. Analyze the test results to find if they yield 
a general factor. This was done; statistical evidence of a general factor 
was found; and Cattell concluded that he and his collaborators were 
Measuring general ability. 

None of the pragmatic and usual criteria of validity were applied, 
excepting correlations with other tests, but to these not much sig- 
nificance was attached. This Culture-Free Test should be regarded as 
a tentative and experimental device until its validity is demonstrated 


in terms other than factorial analysis. 


GOODENOUGH DRAWING TEST * 

This is a test which purports to evaluate a child’s intelligence 
by means of his drawing of a man. It is intended for ages 3% years to 
1314. The child is instructed to make a picture of a man as best he 


16 R, B. Cattell, S. N. Feingold, and S. B. Sarason, “A Culture-Free Test of 
Intelligence: I. Evaluation of Cultural Influence on Test Performance,” Journal 


of Educational Psychology, Vol. 32, 1941, pp. 81-100. 
i E.. Gaodenoih, Measurement of Intelligence by Drawings, Yonkers, 


N. Y.: World Book, 1926. 


260 Nonverbal Group Scales of Mental Ability 


can. He is told to work carefully and to take his time. Scoring is based 
not upon esthetic quality but, rather, upon the presence of essential 
details which presumably indicate the individual's level of perceptual 
differentiation of an object that is very familiar in his environment. 

Although the reliability coefficients are often satisfactory, ranging 
from about .75 to about .90, the drawing test does not correlate well 
with the other types of tests which have been found most useful and 
promising in the measurement and evaluation of general ability. Yet, 
this test is another instance of a device which, in spite of its doubtful 
general validity as a test of intelligence, has been found useful as an 
adjunct to verbal tests when mental deficiency is suspected in the case 
of a given child. 


DAVIS-EELLS TEST OF GENERAL INTELLIGENCE ** 


Rationale. This test of “problem solving ability” is an im- 
portant innovation in the testing of intelligence. It is unusual in re- 
spect to its content and the rationale thereof, and in respect to its 
frank rejection of statistical validation with other tests. In place of 
such validation, the authors substituted intensive interviews with 
children to disclose the “mental problems of a kind found in most of 
the basic areas of children’s lives: school, home, play, stories, and 
work. The specific problems resulted from intensive observation and 
detailed interviewing of children in many areas of activity. . . . The 
extent to which the items in the test deal with problem situations 
which seem real to children of the age levels for which the test is 
planned may best be judged by examining the test with this criterion 
in mind.” The test items selected for initial tryouts and experimenta- 
tion were based upon the “insights of a number of educators and 
sociologists familiar with the characteristic modes of living and child 
upbringing at different socio-economic levels, and in part upon sys- 
tematic observation of children in free-time activities (on playgrounds, 
in neighborhood groups, in schools, etc.) .” 29 


Contents. The instrument finally emerged after extensive research 
and six tryouts with large numbers of children, from widely different 
socio-economic backgrounds, in several sections of the United States. 
It is believed to be culturally fair to all socio-economic groups in 
urban areas. 


15 By Allison Davis and Kenneth Eells. Published by World Book Co., 1953. 
19 Manual, pp. 7 and 18. 


Davis-Eells Test of General Intelligence 261 


Forms for two levels are available: Primary, for grades 1 and 2; 
Elementary, for grades 3 through 6. The items finally selected are 88 
in number, as follows: 


Best-Way items (29). In each item, three pictures show the be- 
ginnings of attempts to solve a stated problem or perform a given 


ET 
A i a 


ric, 10.9. The task is to select the one statement that best 
explains the situation shown in the picture. 


1. The boys want to wash the man’s window and sidewalk. 
2. The man is making the boys wash his window and sidewalk. 
3. You can’t tell from this picture why the boys are w: ashing 


the window and sidewalk. 
From Davis and Eclls Test of General Intelligence. World 
Book Co. (By permission.) 


task. The subject indicates the picture that will lead to the best solu- 


tion of the problem. . o , 

Probabilities items (29). Each picture shows a situation in which 

certain elements are present. The child has to select, from three 

choices, the most probable explanation of what is happening in the 
b 


picture. 


262 Nonverbal Group Scales of Mental Ability 


Picture Analogy items (22). This is the familiar type of item in 
which a relationship between two objects is shown, and the subject is 
required to find a similar relationship in a given set of pictures. 


@\ Se 
COR 


ric. 10.10. The task is to select the picture in which a given sum of 
money can be made, beginning with the right-hand side of the dotted 
line and completing from the left side. In this problem, the sum to be 
made is +0 cents. From Davis and Eells, op. cit. World Book Co. (By 

permission.) i 


Money items (8). In each item, two sets of coins are shown in 
three different combinations. Each combination is incomplete. The 


problem is to discern the appropriate combination, of the three, that 
will yield a stated sum when completed. 


BNE 


ric. 10.11. The task is to place the bottles in the black box in such a 
way that the white box may best be placed on the black one. From 
Davis and Eells, op. cit. World Book Co. (By permission.) 


Of these, 26 are in the Primary form only, 41 are in the Elementary 
only, and 21 are common to both. 


Validity and Reliability. Although Davis and Eells do not base the 
validity of their tests upon correlations with other and earlier scales, 
they do present, for informational purposes, some correlations with 


Davis-Eells Test of General Intelligence 263 


the Otis Quick-Scoring tests, obtained for groups in single grades, 
3 through 6. Of the sixteen coefficients, seven are in the -50’s; the re- 
mainder are rather evenly distributed from .39 (lowest) to .66 
(highest). The authors believe these coefficients are what should be 
expected, since they indicate that the abilities measured by their tests 
bear a substantial relationship to those measured by the other tests; 
yet their tests and the others do not measure altogether the same 
factors. 

Correlations with standardized school achievement tests are also 


reported. These are: 


with reading: 43 
with arithmetic: .41 
with language: 40 
with spelling: 24 


These coefficients are significantly lower than those found with the 

more usual types of individual and group scales; but this is to be ex- 

pected because of the very nature of the Davis-Eells tests, and since 

the authors’ position is that several significant factors other than prob- 

lem-solving ability contribute to success in school achievement. 
Split-half reliability coefficients range as shown below: 


grade 1: .68 
grade 2: .82 
grade 3: .84 
grade 4: .83 
grade 5: .82 
grade 6: .81 


A coefficient of .68 is much too low for predictive purposes. The re- 
maining indexes are moderate, but not as high as is optimally desirable. 
Test-retest reliabilities (two-week interval) are: 


grade 2: .72 
grade 4: .90 


On the whole, these coefficients, if typical, indicate that the present 
test is not so reliable for refined differentiation as it is for differentiat- 
ing between levels; for example, very superior, superior, average, in- 


ferior, very inferior. 


264 Nonverbal Group Scales of Mental Ability 


In connection with the validity and general characteristics of these 
tests, several aspects should be pointed out. The authors aimed to de- 
velop an instrument that would present situations and problems in 
forms that are within the experiences common to all children in the 
groups for whom they were planned. This meant the elimination of 
language (except in the directions) and other cultural factors that 
might favor one group and handicap another. While nonverbal tests 
are not new, the kinds of situations presented are novel. Furthermore, 
the psychological functions to be sampled by these and similar prob- 
lem-situations were determined and specified after psychological inter- 
views and psychological analysis, rather than by factorial analysis. 


Evaluation. The foregoing difference in procedure is very significant 
with regard to principles of test construction. Factorial analysts de- 
vise a large number of test items and subtests on the basis of certain 
assumed mental processes involved; the scores on these are intercor- 
related and further analyzed in order to find common functions, tO 
combine and reduce the number of subtest categories, and to name 
them. Davis and Eells, on the other hand, first ascertained the kinds 
and range of problems children deal with: then by interview and 
analysis of children’s responses they determined that certain psycho- 
logical functions were operative and had differentiating significance. 
The functions they specify are association, insight, reasoning, and 
organizational ability (method of attacking a mental problem). After 
the items for these tests were developed, each of a group of children 
was interviewed to determine whether the problems evoked the men- 
tal processes which the tests seek to measure. It was found that 92 
percent of the children who answered analogy problems correctly 
also explained the analogy-relationship correctly. And nearly all pupils 
explained the relationships correctly in solving the other types of 
subtest problems. These findings strongly indicate that the test prob- 
lems evoked in the successful subjects the mental processes which the 
authors intended should be utilized. 

Like all other new tests, the Davis-Eells will have to be subjected 
to investigation to determine their predictive efficiency, Since the 
authors point out that school achievement is dependent only in part 
upon problem-solving ability (which is to be measured by their tests)» 
it is to be expected that predictive efficiency will not be very high, i" 
terms of correlations, with school grades. It will be necessary, then, tO 


Evaluation of Nonverbal Group Scales 265 


use other selected criteria of demonstrated problem-solving ability; 
or to utilize pupils in whom this ability is the only significant variable, 
while the others are practically constant. 


EVALUATION OF NONVERBAL GROUP SCALES 


Uses. A survey of available nonverbal scales shows, for the 
most part, that they are valuable with children who have had limited 
educational opportunities or impoverished social backgrounds, with 
young children who have not yet learned to read, with older pupils 
who are handicapped by reading or language difficulties, and with 
illiterate or non-English-speaking adults. Possible exceptions to this 
statement of limited usefulness are the Pattern Perception, the Pro- 
gressive Matrices, and the Davis-Eells Tests. 

Nonverbal tests are valuable, also, for the better diagnosis of cases 
who, on verbal tests, have intelligence quotients between about 60 
and 75, and who, therefore, would be considered as subjects for 
special educational treatment or possibly institutional care. The ex- 
amining psychologist might be in doubt with regard to such border- 
line cases; but if the results of the nonverbal tests confirm those of 
the verbal, he has reason to allay his doubts. However, if the rating on 
the nonverbal tests is significantly higher, then the case will require 
further study to account for the discrepancy. 

Nonverbal tests can be clinically useful, also, with individuals whose 
intelligence quotients are higher than 75; that is, for individuals who, 
on verbal tests, appear to be significantly less capable than there is 
reason to believe they actually are, on the basis of other information 
about them. In this connection, they are particularly useful in popu- 
lation centers having large numbers of non-English-speaking homes. 

In whatever situation nonverbal tests are used, the examiner must 
realize that defective vision or slow psychomotor responses can be a 
handicap. The first of these handicaps points up the importance of 
clear drawings—a condition that is not always satisfied. 

Tests of mental ability have had their greatest usefulness in schools, 
where they have been utilized for purposes of educational and voca- 
tional guidance, as well as in the diagnosis of learning difficulties in 
the case of a particular individual. Nonverbal group tests have been 
found valuable in efforts to determine aptitude and promise in shop 
work, mechanical drawing, architectural drafting, and occupations of 
a mechanical or quasi-mechanical nature—all of which make demands 


266 Nonverbal Group Scales of Mental Ability 


upon those psychological activities which enter into problems in- 
volving geometric perceptions and reasoning with the concrete rather 
than with the abstract. 

For these purposes, nonverbal group scales may not be used in- 
discriminately, for the following reasons. Some are measures prin- 
cipally of detailed visual perception (Revised Army Beta); some at- 
tempt to span too wide an age range and are not able to make any but 
rather gross differentiations (Non-language Multi-Mental Test, which 
is scaled for ages 33 months to 236 months); at the higher age levels, 


some are less suitable for women than for men because of specialized 
content. 


Validity. Studies of the validity of nonverbal scales show that while 
most of them correlate significantly with scales of the verbal type (in- 
dividual and group), the coefficients are far enough removed from 
unity to warrant using the two types as supplements rather than as 
equivalents. When scores on verbal and on nonverbal scales are cor- 
related, for children in the earlier grades (approximately through 
grade six), the coefficients obtained are usually in the .60’s and .70’s, 
with relatively few in the .80’s. But when the subjects tested are pu- 
pils in the later grades, the coefficients usually fall in the .50°s and 
-40’s, with a few lower and a few higher. These generally lower co- 
efficients, in the case of pupils in the later grades, are due to the in- 


ability of most available nonverbal tests to discriminate between in- 
dividuals in the upper levels of ability. 


Functions Measured. The reader will recall that most, though not all, 
authors of nonverbal tests of mental ability seek to measure the same 
mental processes as those tested by means of verbal scales. Some of 
these authors are unequivocal in maintaining that the nonverbal tests 
require essentially the same type of intelligent performance as that 
required by the abstract symbols of language and number. They hold 
that the problems presented in diagrams, pictures, ch 
metric forms closely parallel those presented by mea 
and number. For example: picture arrangement is re 
similar in function to disarranged sentences; picture 
to word analogies; picture completion similar to s 
tion; reasoning with geometric patterns simil 
numbers and words; perceiving similarities, 


arts, and geo- 
ns of language 
‘garded as being 
analogies similar 
entence comple- 
ar to reasoning with 
differences, and part- 


Evaluation of Nonverbal Group Scales 267 


whole relationships in pictures and patterns similar to such relation- 
ships in language. Many nonverbal tests, however, suffer from the fact 
that they attempt to assess general ability by means of homogeneous 
test items or by means of a very limited number of subtests. 

Related to the question of measuring general ability by means of a 
more or less homogeneots nonverbal scale, it is important to note that 
it is quite possible to find some sort of general factor—which is actu- 
ally a statistical and theoretical concept—in a series of subtests such 
as those in Cattell’s and others, but the statistical derivation of a gen- 
eral factor does not in itself give proof or assurance that the tests are 
necessarily measuring the functions being attributed to them. For ex- 
ample, a series of tests might be devised in which a “general factor” 
emerges, and which upon psychological analysis proves to be a factor 
of speed of work. In another instance a general factor of visual acuity 
may emerge, etc. Fundamentally, in any statistical analysis of test 
data, the factor or factors that emerge will depend upon the initial 
characteristics of the tests themselves. In establishing validity of a 
Scale, therefore, it is inescapably necessary to analyze the psychologi- 
cal processes involved in the tests and to compare the results they 
yield with forms of activity and evidences of ability which are indica- 
tive of intelligent behavior in our culture. 

The significant coefficients of correlation found between verbal and 
nonverbal tests of intelligence demonstrate that there is merit in the 
View that the two types are, in a significant degree, measuring the 
same or associated functions. But this does not mean that verbal and 
nonverbal tests are equivalent; for one type may also involve certain 
functions not involved in the other, or one may demand a higher level 
of the same functions being tested than does the other. 

Language and number are symbolic systems which represent some- 
thing else: e.g., objects, qualities, events, actions. Development of 
abilities in language and number facilitates intelligent behavior; for 
the use of these symbols expands the individual’s range of experience 
beyond the limits of the immediate situation. Development of lan- 
guage and number makes possible a finer discernment of forms and 
Objects in the world surrounding the individual; for with the use of 
language and number he is enabled to analyze, synthesize, classify, 
and organize his perceptions. Objects and events, at first vague, are 
more sharply defined; likenesses and differences are accentuated; 
evaluations are refined. Language and number enable individuals 


268 Nonverbal Group Scales of Mental Ability 


also to organize their thinking into larger and more comprehensive 
unified patterns. 

It is because the use of language and number requires that the in- 
dividual go beyond the immediate concrete situation and because 
he thereby can engage in more complex and subtle mental operations 
that many psychologists regard ability to deal with symbols as a 
higher form of intellectual activity than the ability to deal with con- 
crete objects. They therefore prefer to test intelligence, whenever pos- 
sible and appropriate, by means of verbal and numerical materials. 
They would, however, use nonverbal tests, when these are made nec- 
essary by developmental immaturity, language, or cultural handicap, 


to gain the insights that these tests provide if they are adequately 
scaled in difficulty. 


Cultural Influences. The emphasis upon verbal and quantitative 


aspects of intelligence in many of the individual and group scales has 
given rise to a misapprehension regarding the nonverbal scales: 
namely, that these latter are “culture free”; and, in fact, one test 
author (Cattell) has so named his scale. Inspection of the tenis in this 
scale and others reveals that they utilize many objects that children 
and older persons learn about through experiences in their environ- 
ments. These experiences are dependent upon a culture just as devel- 
opment of verbal and quantitative abilities are. The differences are 
matters of degree of cultural influence and universality or near- 
universality of experience. Consequently, it is preferable to speak of 
“culture fair” tests rather than “culture free” tests, in connection with 
tests which utilize materials that do not handicap ae favor any segment 
of the population for whom the test is intended. 

The presence of cultural influence in a test that appears to be “cul- 
ture free” was demonstrated in a study made with several tribes of 
North American Indians.” The Goodenough Draw-a-Man Test was 
used. In a group of Hopi Indians, the mean 1Q for boys was 123, while 
for girls it was 102. Zias also showed appreciable differences in favor 
of boys, whereas in a group of Navahos, the means of boys and of 
girls were very nearly equal (107 and 110). The sex differences or 
similarities within each tribe are attributed to sex differences or simi- 


OR. J avighurst et al., “Environment and the Draw-a-Man Test e Per- 
H L I n al st: th e 
formance of Indian Children,” Journal of Abnormal and Social Psychology 

Vol. 41, 1946, pp. 50-63 = 


Evaluation of Nonverbal Group Scales 269 


larities in training and experience within each culture. Boys and girls 
are trained to observe different aspects and details of their environ- 
ment and are taught different types of drawing. The two sexes have 
different functions in their group; these functions are reflected in dif- 
ferentiated training; the differences in training are reflected in differ- 
ences in performance. 

Since every person must develop in an environment of some kind, 
his skills, information, repertory of responses, modes of thinking, etc., 
are to some extent culturally conditioned. Some psychological tests 
are more “culture fair” than others. At this point we recall again 
Binet’s principle that a test of intelligence should be consonant with 
the milieu of those who are to be measured by it. 


Il. 


AVA 
ava 


VERBAL AND MIXED GROUP 
SCALES OF MENTAL ABILITY 


THE scales presented in this chapter are either entirely or predomi- 
nantly verbal in content. The proportion of nonverbal materials varies 
in different scales; but it will be noted that even in those instruments 


that contain appreciable portions, symbolic materials (language and 
number) predominate. 


Since there are many scales that come within this classification, it is 


neither possible nor necessary to describe and analyze all of them. It is 
the purpose of this chapter to present a sufficient number of repre- 
sentative scales so that the student can know their characteristics and 
content, their quality, advantages and disadvantages, their similarities 
and differences. The details of content are presented in order to pro- 
vide the student with a clearer conception of the psychological proc- 
esses being tested than would otherwise be the case. The statistical de- 
tails for each test will enable the student to see clearly the techniques 
of standardization used and the degree of success achieved, upon 
which evaluations of these instruments must rest. 


CALIFORNIA TESTS OF MENTAL MATURITY ' 

Contents. The revised series provides scales on five levels: 
preprimary (kindergarten and entering grade 1); primary (grades 
1-3); elementary (grades 4-8); intermediate (grades 7-10 and 


1 By E. T. Sullivan, W. W. Clark, and E, W. T 
Test Bureau, Los Angeles, 1951. 


iegs. Published by California 


California Tests of Mental Maturity f 271 


adult); advanced (grade 9 and adult). All of these are designed to 
test the same “mental factors”; hence, since the series covers the wide 
age range indicated, it is essential that the content of each scale be 
adapted to its particular level, as regards difficulty and form. Thus at 
the earliest levels there must be emphasis upon nonverbal materials, 
with minimum requirements made upon word knowledge and number 
concepts. The later levels increase in their demands upon word knowl- 
edge, number concepts, and reading of and reasoning with numerical 
and verbal materials, while increasingly complex nonverbal materials 
are also retained. 

We shall describe only the elementary scale (grades 4-8), since 
it is representative of the entire series. 

The elementary scale consists of twelve subtests, grouped under five 
headings, or factors, as follows: 


Memory— 
1. Immediate recall: series of words pronounced in pairs; then 


only the first word of each pair is repeated, and the subject is to 
recall the second word of the pair. 

2. Delayed recall: a stosy is read to the subjects. Thirty min- 
utes later they are given a series of multiple choice items to test 
the extent to which details of the story are recalled. 


Spatial relationships— P i 
3. Sensing right and left: 20 pictures of hands and feet in 


Various positions. The task is to discriminate between right and left. 
4. Manipulation of areas: spatial patterns of a variety of forms 
and in different positions to be manipulated, to test spatial imagery. 


Logical reasoning— , . pe f 
5. Opposites: 15 sets of drawings, showing five objects in each 


set. The first is the usual “given” object; the testee selects from the 
other four the one that is the opposite of the first. 

6. Similarities: the well-known classification test, using 15 sets 
of drawings. The first three in each set are alike in some respect. 
The task is to select the similar item in the remaining four draw- 
ings. 

7. Analogies: the familiar test of relationships, using drawings. 
The first two stand in some relationship (a hat and a man’s head); 
a third item is given (a shoe). The task is to select a fourth item 


272 Verbal and Mixed Group Scales of Mental Ability 


that bears the same relationship to the third as the first does to the 
second. ; i . r 

8. Inference: a major and a minor premise are given. The tas 
is to select a logical conclusion from among several alternatives. 


Numerical reasoning— , 

9. Number series: each series of numbers increases or de- 
creases according to a principle. The testee has to discern that prin- 
ciple. , , 

10. Numerical quantity: number concepts using coins. The sub- 
ject indicates how many coins of each denomination are required 
to make up a specified sum. 

11. Numerical quantity: arithmetical problems. 


Verbal concepts— 


12. Word similarities: 100 given words. In each item, the task 


is to select one word, from four, that is synonymous, or nearly so, 
with the given word. 


This scale also provides three optional subtests: of visual acuity, 
auditory acuity, and motor coordination. These are not included in 
the scoring. They may be used to learn whether, in respect to these 
functions, the subject will or will not be handicapped on the subtests 
that are scored. The use of these three preliminary tests can be very 
helpful in identifying persons who would be under a disadvantage in 
taking a group test, and who should, therefore, be examined indi- 
vidually. When an individual scale is used, existing handicaps can be 
overcome, in part at least, and can be taken into 


account in the inter- 
pretation of the testee’s performance and score, 


Scoring. An aspect of these scales worthy of special note is the 


method of scoring. The score is obtained for each of the divisions 
(e.g., immediate recall, delayed recall) under each of the factors 
(memory, etc.). For each Separate score, a mental-age rank is found 
and plotted on a scale, or profile. The separate scores of each divi- 
sion are then added to give the score for the particular factor (in this 
instance, memory), which is also plotted on the profile. The scores 
of all factors are then added to yield the total score, from which the 
usual indices may be derived. Also, the appropriate subtest scores are 
added to obtain a rating on language factors (subtests 2, 8, 11, W234 


California Tests of Mental Maturity 273 


then the scores on the remaining subtests are added to get a rating 
on nonlanguage factors. 

Thus these scales enable the examiner to obtain: (1) ratings on the 
several subtests: (2) separate mental ages and intelligence quotients 
for verbal subtests combined, nonverbal subtests combined, and total 
score of all subtests. This type of scoring and profile permits ready 
analysis of the subject’s weaknesses and strengths, consistencies and 
inconsistencies, in the types of mental operations being tested, as- 
suming, of course, that the test materials are valid and the reliability 
is high. 


Validity and Reliability. Although the norms for this scale are“. . . 
based on a controlled (stratified) sampling of over 125,000 cases 

”* reliability statistics were based on only 725 pupils in grades 
4to6 “. . . in representative school districts.” The split-half method 
(presumably odd-even) was used, the coefficients having been cor- 
tected by the Spearman-Brown formula. The results were: 


“total mental factors”: coefficient = .95; 
standard error of measurement, 3.5 IQ points. 
“language factors”: coefficient = .94; 
standard error of measurement, 3.9 1Q points. 
“non-language factors”: coefficient = .92; 
standard error of measurement, 4.5 IQ points. 

.87 (spatial relationships) to .92 (memory). 
Standard errors of measurement in the subtests range 
from 4.5 to 5.8 months of mental age. 


On the whole, these coefficients fall within the range of quite satis- 
factory correlations, particularly for the two major divisions and for 
the total score, The standard errors of measurement, in terms of IO 
Points, also compare favorably with those of the sounder tests. Al- 
though the errors of measurement for the subtests are somewhat larger, 
this is what would be expected. Once again, greater reliability of wide 
rather than narrow sampling of performance is demonstrated. 

It is in respect to the accepted validity criteria that the authors of 
these California scales do not provide adequate data.” The authors’ 


* Manual, p. 22. The socio-economic distribution of the sampling is not re- 


Ported in the manual. ‘ r 
3Ina mimeographed prospectus kindly furnished by the tests’ authors, there 


274 Verbal and Mixed Group Scales of Mental Ability 


purpose, to begin with, is to measure most of the kinds of mental 
processes sampled by the Binet scales, based upon an analysis of the 
“conceptual framework” of the Binet; but no correlational data with 
the Stanford-Binet are given in the manual. Unpublished data, pro- 
vided by the publishers of these scales, show the following results 
(Table 40) obtained with the Elementary scale. 


TABLE 40 
Correlations between California Elementary Scale and S-B 
California Stanford-Binet 
Md.IQ S.D.IQ  Md.IQ S.D.10 N r 
107.3 28.5 108.2 29.3 283 94 
122.0 27.9 128.5 28.8 182 93 
79.2 22.7 795 21.6 101 90 


Several significant facts and inferences emerge from the foregoing 
table. First, there is close correspondence between the two scales in 
respect to medians and standard deviations of intelligence quotients. 
Second, the groups are not representative of the population for which 
the scale is intended, since the medians are significantly above or be- 
low 100, and the S.D.’s are very much larger than would be found in 
an unselected population (namely, about 16), Third, due to the very 
Wide range of ability, as represented by the very large standard devia- 
tions, the Probability is that the correlation coefficients are significantly 
higher than they would be for a population of narrower and typical 
range. Fourth, it would be desirable to determine this aspect of validity 
Separately for each age group, since close correspondence at one age 
level or for combined age levels does not necessarily assure equally 
close agreement at Other age levels, 

The manual furnishes data on the intercorrelations of the subtests, 
based on 1048 cases in grades 4 to 6. In view of the authors’ previous 
preference for the group factor theory of intelligence (stated in earlier 


editions of the scales) and of the generally low intercorrelations of the 


are extensive data dealing with the 1947 
This prospectus reports significant correl 
other group scales. For the “language fact 
efficients are largely in the -70's 
are in the .50's. Correlation Coefficients with school g 
with all tests of intelligence, and are higher for the “language factors.” For 
these factors the correlations are, for the most Part, in the .50’s, -60's, and .70'S- 
Similar data are not provided for the elementary scale described herein. 


rades vary, as they d0 


California Tests of Mental Maturity 275 


present subtests, we may conclude that these coefficients are regarded 
as one evidence of the scale’s validity: namely, that the subtests have 
relatively little in common regarding measured factors; hence they 


satisfy one requirement of the accepted theory.* 
The subtest intercorrelations may be summarized as follows: 


range of subtest intercorrelations: .25 (spatial relationships and 
numerical reasoning) to .60 (memory and verbal concepts) 
seventy percent of the coefficients are below .50; fifty percent are 


below .40 : 
range of correlations of subtests with language subtests: .35-.95 


(in part self-correlations) , 
range of correlations of subtests with nonlanguage subtests: .55— 


-78 (in part self-correlations) ; 
range of subtest correlations with total scores: .60-.86 (in part 


self-correlations) 


Regarding validity, one may draw some inferences, however, from 
data provided in other connections. (1) Median IQ for the entire 
population sample is 100, with a standard deviation of 16. (2) The 
Population samplings at the higher educational levels show a progres- 
sive increase in medians and decrease in standard deviations, until at 
the college sophomore level these are 114.5 and 13.5, respectively, 
and at college graduate level they are 124 and 12. Such progressive 
increases are to be expected, since advancing educational levels are 
More or less selective as regards intelligence. 

Even when a scale is validated within a theoretical conception and 
framework, it is still necessary to subject the scale to other validating 
Procedures to determine if it actually serves the purposes for which it 
is intended, In other words, does the scale “work”? What is its predic- 
tive efficiency? The necessity of answering these questions is the 
reason why all tests of intelligence should be validated against ac- 
cepted criteria of intelligent behavior. Just devising test materials that 
Satisfy a theoretical conception gives no assurance of the scale’s prac- 


tical and predictive validity. 


* The reader will recall that the original Binet scales and the Stanford-Binet 
are based upon a general factor theory. ; 

In the test manual, page 4, it is stated that “The total mental factors score 
has been found by the authors and other Investigators to correlate as high or 
higher with the individual Stanford-Binet than any other mental ability test.” 
No data relevant to this aspect are given in the manual; but the several correla- 
tions reported above were provided in a personal communication. 


276 Verbal and Mixed Group Scales of Mental Ability 
TERMAN-McNEMAR TEST OF MENTAL ABILITY ° 


Contents. This scale (in two equivalent forms) is intended 
for use primarily in grades 7 through 12, though norms are provided 
from the age of 10 years through 19 years, 11 months. The scale con- 
sists of seven subtests: information, synonyms, logical selection, classi- 
fication, analogies, opposites, and best answer. 


Items from 
Terman-McNemar Test of Mental Ability 
(World Book Co. By permission) 
Information: 
Polo is a kind of 
(1) disease (2) work (3) bear (4) game 
(5) language 


Synenyms: 


Comic—(1) clumsy (2) laughable 
(3) universal (4) tricky 
(5) peculiar 
Logical selection: 
An orchestra always has 
(1) violinists 
(4) saxophone 
Classification: 
(1) Catholic 
(4) Republican 
Analogies: 


(2) piano (3) musicians 
(5) singers 


(2) Methodist (3) Presbyterian 
(5) Baptist 


Zoo is to animal as aquarium is to: 


(1) birds (2) fish (3) bees 
(5) butterflies 


Opposites: 


Exit: (1) emit 
(3) entrance 


(4) statues 


(2) transcend 


(4) origin 
(S) arrival 
Best answer: 
The saying, “Idle brains are the devil’s workhouse,” means 
(1) The devil is lazy. 


(2) People who are idle get into trouble. 
(3) Many hands make light work. 
(4) The devil works with his brains. 


5 By L. M. Terman and Q. McNem: 
1942. This scale is a revision of the 
which was published in 1920. 


ar. Published by World Book Company, 
Terman Group Test of Mental Ability 


Terman-McNemar Test of Mental Ability 277 


The content of the scale is quite homogeneous in that it is entirely 
verbal in character. The scale is thus consistent with Terman’s defini- 
tion of intelligence: that is, the ability to deal with symbols and ab- 
stractions. 

The authors subscribe to the general factor (g) theory of intelli- 
gence. They hold that the general factor is best tested by means of 
materials using symbols and abstractions. In order to achieve a high 
degree of homogeneity in test materials, they even omitted from their 
scale arithmetical and numerical types of subtests, which are widely 
regarded as being very good tests of the general factor. The authors 
state the reason for their selection of materials as follows: “More 
homogeneous material has been used in order to have a test more 
highly saturated with a common factor or ability. Thus, the exclusion 
of arithmetical and numerical subtests means that the scores of any 
two individuals are more nearly comparable qualitatively; i.e., they 
lie along the same continuum. This continuum may be characterized 
as general verbal intelligence.” ° The usefulness of this scale, there- 
fore, is restricted to subjects who are not laboring under a language 
handicap and to situations wherein “verbal intelligence” is required 


as the sole or major ability. 


Validity and Reliability. Reliability of this scale is presented in terms 
of three familiar methods: the split-half method (correlation of scores 
on odd-numbered and even-numbered items), the interform method 
(correlation of scores on the two forms given to the same subjects), 
and the probable error of measurement. The split-half reliability co- 
efficient was .96 (279 cases, grades 7 through 9), while the interform 
reliability coefficient was .95 (239 cases, grades 7 and 9). When an 
age range of only one year was taken (13-6 to 14-5), both coefficients 
were .96, The probable error of measurement was found to be 2.2 
Standard score points. (In terms of the scoring units of the scale, this 
Probable error can be considered small.) 5 

Validity of individual items was determined primarily on the basis 
of percent of pupils passing each item in the successive grades—in 
other words, the extent to which each item differentiates between 
groups at different levels of maturity. A second criterion of item 
Validity was the correlation of each item with total scores.* No item 


® Manual of Directions, P- 1- 5 — . i 
7 For this purpose, the “tetrachoric correlation” was used. This technique 


differs from the more familiar one used in correlating two sets of variable 


278 Verbal and Mixed Group Scales of Mental Ability 


was retained if it yielded an average coefficient of less than aoa 
fact, ninety percent gave correlations of .40 or higher, with =F a Fà 
age of .53. Thus, for validity, the final selection of items Was m 5 
sn the basis of item difficulty and degree of relationship of item pe 
formance to total scale performance. 


Scoring. Two statistical factors are at the base of the rating pera 
used: (1) a single origin; and (2) comparable units in all parts of the 
scale. The reader is already familiar with the fact that raw scores do 
not provide comparable units throughout a scale. For the Terman- 
McNemar scale, this problem is met by a type of standard-score scale 
which uses the median of the 14-year age group of the national stand- 
ardization population as the origin, and the standard deviation of this 


age group, arbitrarily made 16 points, as the unit of measurement. 


“Scores on this scale for all age groups are thus measured from a 


single origin and provide comparable units throughout all parts of this 
scale,” * 

In other words, the authors of the sc 
scaling which is a variant of the familia 
age group is taken as the standard; n 
given in terms of standard scores of thi 
14, the median raw score was 76; this 
score of 100. The raw score stand 
was assigned an arbitrary value 


ale have devised a method of 
r standard score, The 14-year 
orms of other age groups are 
e 14-year group. Thus, for age 
was arbitrarily called a standard 
ard deviation (actually about 27) 


of 16. No reason is cited for the 
choice of this particular number. It may be noted, however, that 16 


was the most typical value of the standard deviation for the IQ's of 
the 1937 revision of the Stanford-Binet 


ber to use in a group scale which attem 


pts to measure the same type 
of ability. Having assigned this arbit 


Tary value, the authors then 


a mean of 100 and S.D. of 16. 
of the 13-year group (13 years, 
with a standard deviation of 16 


scores in that the underlying assumptions are different; but interpretation of 

obtained coefficients is approximately the same. For an explanation of the 

tetrachoric method, see any standard textbook in statistics, 
8 Manual of Directions, p. 4. 


Tests of Primary Mental Abilities 279 


a standard score value of 93, which then becomes the norm for the age 
13 years and 0 months. 

The procedure for finding an individual's standard-score rating on 
this scale amounts to this: the raw score is obtained; each raw score 
is found in a table which gives its standard score equivalent; each 
Standard score is found in another table giving its equilavent mental 
age. The IQ, however, is not found by means of the usual formula 
(IQ = MA/CA). In fact, the mental age is not used in calculating IQ 
for this scale. Instead, the authors provide a table for finding what is 
called the “deviation !Q,” so called because “Basically the procedure 
for computing deviation 1Q’s requires that the difference be found be- 
tween the [individual’s] obtained standard score and the average 
Standard score for other individuals of the same age. This difference 
or deviation is then interpreted directly in terms of IQ [from a table]. 
This can be done because both IQ’s and the normalized standard 
Scores are distributed normally.” ° In other words, Terman and Mc- 
Nemar assume that raw scores and IQ's are both normally distributed. 
When this assumption is made, it is possible, through knowledge of the 
Characteristics of the two curves, to transmute one set of scores di- 
rectly into the other. ee P 

e have presented this scale’s method of deriving 1Q’s in some de- 
tail because it is essential that the student of psychological testing be 
aware of the several techniques and of the fact that indexes called by 
the same name are not always derived in the same manner, First, we 
have the original and most common method of finding IQ: mental age 
divided by chronological age. Second, there is the method used with 
the Bellevue scale. And third, there is the type of deviation IQ de- 


Scribed above. 


TESTS OF PRIMARY MENTAL ABILITIES *° 


Contents. The Thurstones have published a series of group 
tests variously known as The Chicago Tests of Primary Mental Abilities 
and The SRA Primary Mental Abilities. The latter are more recent 
Scales, shorter and less satisfactorily standardized and reported than 
the original Chicago PMA tests, and on the whole inferior to the 
earlier versions. The following description, therefore, will be limited 


"°M l of Directions, p- 9- . ; 
49 Mie wae G. Thurstone. Published by Science Research Associates; 


several forms, 1938-1950. 


280 Verbal and Mixed Group Scales of Mental Ability 


to the Chicago PMA tests (1943), devised for ages 11-17. This a 
ticular scale will serve our purpose, since all of the scales in he 
Thurstones’ series are based upon the same psychological and statisti 
cal principles, even though all are not of equal merit. 

The PMA scale for ages 11 to 17 is constructed upon the group- 
factor theory of mental ability; that is, upon the theory that intelligence 
consists of the operations of certain distinguishable and relatively in- 
dependent mental functions. (See Chapter 3.) The “primary abilities 
to be tested by means of this scale are those which L. L. Thurstone 
and his collaborators report as having been isolated by factorial 
analysis. The “primary factors” measured are six: number facility, 
verbal comprehension, spatial perception, word fluency (extent of 
word associations as distinguished from verbal comprehension), rea- 


soning, and rote memory. Each of these is measured as indicated be- 
low: 


Number facility, by tests of addition and multiplication. 

Verbal meaning, by one test of vocabulary (the familiar multiple- 
choice form) and one of supplying words to fit given definitions. In 
each item of the latter, five letters are provided, one of which is the 
first letter of the correct word. 

Spatial perception, by tests in which designs and geometric figures, 
differently rotated, are to be identified as being the same as or differ- 
ent from a given design or figure. 

Word fluency, by two tests—one requiring that, within a time 
limit, as many words as possible be written, beginning with a given 
letter; the other requires that as many four-letter words as possible, 
beginning with another letter, be written within a time limit. 

Reasoning, by two tests involving, in one, perception of the pat- 
terns within series of letters of the alphabet, and, in the other, per- 
ception of the patterns within letter groupings. 

Rote memory, a names test. There are twenty cards, on each of 


which are a first and last name. The cards are exposed consecutively, 
each for fifteen seconds. The sub 


i jects are then required to pair off 
each last name with its correct first name (chosen from seven names 
in multiple-choice form). 


Validity and Reliability. By means of the split-half method, reli- 
ability coefficients were separately calculated for five of the six sub- 
tests for grades 6, 8, 10, and 12, with approximately 200 subjects at 
each half-grade level. No reliability coefficients are provided for “word 


fluency,” since the scores on this subtest do not lend themselves to the 


split-half method. Instead, it would be necessary to use the re-test 


Tests of Primary Mental Abilities 281 
COMPLETION 


Read the definition below. Think of the word which fits the definition. The first letter of the 
word is in the row of letters under the definition. 


The first meal of the day. 


Cc 


The word is “Breakfast.” "B" is marked because it is the first letter of the word “Breakfast.” 


Do the following example: 


A place or building for athletic exercises. 


D= = 


FIGURES 


Look at the row of figures below. The first figure is like the letter F which is right side up. 
Alll the other figures are like the first but they have been turned in different directions. 


F < A a > You 
Satisfy yourself that all of these figures look like the first one if they are turned right 
side up, 
Now look at the next row of figures. The first one looks like an F. But none of the other 
figures would look like an F even if they were turned right side up. They are all made 


backward. 


F 94a ry ow” € E 


xt row are like the first figure. Some are made backward. 


marked. 
B c D E F 


Some of the figures in the ne 
The figures like the first figure are 
A 


ric. 11.1. Items from Chicago Tests of Primary Mental Abilities. Science 
Research Associates. (By permission. ) 


8 Verbal and Mixed Group Scales of Mental Ability 
202 
CARDS 
Here is a picture of a card. It looks like an L, and it has a hole in one end. 


L 


The two cards below are alike. You con slide one around on the page to fit the other 


n mae 


Now look at the next two cards. They are different. You cannot make them fi 


sliding them around on the page. 


Here are more cards. Some of the cards are marked, The cards which are like the firs! 
card in this row are marked. 


t exactly by 


ric. 11.2. Item from Chicago Tests of Primary Mental Abilities. Science 


Research Associates. (By permission, ) 


method; but this had not yet been done 
manual were published (1943). With the exception of rote-memory 
reliabilities, the coefficients reported are all between .95 and .98 (cor- 
rected by the Spearman-Brown formula). These are very high. The 
memory tests, however, showed coefficients of low reliability, ranging 
from .63 to .82. Seven of these coefficients were in the .60’s, two in the 
-70’s, and three in the .80’s, : 
Validity of the Chicago scale is reported in a manner quite dif- 
ferent from those which are used with most other scales thus far dis- 
cussed. Having accepted the group-factor theory, and having analyzed 
mental abilities into the six “relatively independent” factors already 
specified, the authors proceeded on the hypothesis that the scale would 
be valid if the correlations between the primary abilities were rela- 
tively low; for such correlations would show relative independence of 
factors, as required by the theory, and would thus satisfy the under- 
lying group-factor theory of mental abilities. In other words, the 
adopted theory of intelligence is used as the criterion of validity 
rather than the more usual and generally accepted evidences of mental 


at the time the scale and its 


Tests of Primary Mental Abilities 283 


development and intelligent activities. Or, to state the same thing 
otherwise, the criteria of the scale’s validity are internal rather than 
external. The intercorrelations thus found range from .13 (spatial 
perception with rote memory) to .58 (reasoning with verbal com- 
prehension). About three fourths of the coefficients are above .30; 
and about half are above .40. 


LETTER GROUPING 
Look at the groups of letters below. 


ACFH AACG 
AABC ACAD SE AC 


pore) 


Three of the groups have two A's. The group which does not have two A's is marked. 


Here is another problem. Three of the groups ore alike in some way. Can you find three 
groups which are alike? Mark the one that is different. 
` 7 
XURM ABCD MNOP EFGH 


In three of the groups the letters are arranged in alphabetical order. The first group is not 
in alphabetical order. You should have marked it to show that it is different. 


LETTER SERIES 
Study the series of letters below. Whot letter should come next? 


abababab abcdef 


The next letter in this series should be a. The letter a has been marked in the answer row 
at the right, 


Now study the next series of letters and decide what the next letter should be. Mark the 
letter in the answer row at the right. 


cadaeafa acdefg 


You should have marked the letter g. 


Fic. 11.3. Items from Chicago Tests of Primary Mental Abilities. Science 
` Research Associates. (By permission. ) 


An individual’s raw scores obtained on this scale, for each of the 
“primary abilities,” are converted into percentile ranks which are then 
Plotted on a profile. In the first versions of the tests, no mental age, 
intelligence quotient, or other general index was obtained, “Since the 
Principal purpose of the present test is to obtain a profile of the six 


84 Verbal and Mixed Group Scales of Mental Ability 
2 


i ntal abilities for each child . . .”* It was the view of the 
acess his scale that representation of mental abilities by means 
et a ; oe only consistent with their theory of mental organiza- 
m : =r Segone but that a profile is most valuable for purposes 
a. repreetoi ae ie individual's performance and for educational 
mid vocational guidance. 


NV S w R M 


100p 
= 
90 i i - i 


œ 
So 


So 


o y 
© 


Percentile Rank for Age 


w 
c=) 


DS 
i=) 


10 = ai 


ric. 11.4. An Individual Profile 
Chart—Chicago Tests of Primary 
Mental Abilities. Manual of In- 
structions, Chicago Tests of Pri- 
mary Mental Abilities. Science Re- 
search Associates. (By permission. ) 


The data thus far available indicate, however, that taken as a whole, 
as a composite, the tests of “primary mental abilities” are in some m- 
stances moderately and in others poorly correlated with scholastic 


performance at the high-school and college levels. The data show 
also that the “primary mental abilities,” 


or in “patterns,” do not differentiate si 
interested in or engaged in the study of v 


when considered separately 
gnificantly between persons 
arious professions (engineer- 


1 Manual of Instructions, p. 19. In later versions of the scales, the authors 


provide mental age units and IQ equivalents. However, they still regard sep- 
arate percentile ratings as superior to MA and 1Q. 


Kuhlmann-A nderson Tests (6th Edition) 285 


ing, science, linguistic studies, medicine, journalism, etc.).7* Judg- 
ment must be suspended, therefore, regarding the value of these tests 
in specific rather than general problems in educational and vocational 


guidance. 


KUHLMANN-ANDERSON TESTS (6TH EDITION) * 


Description. The first edition of this series of tests was pub- 
lished in 1927. Since that time they have undergone extensive ex- 
perimentation, re-standardization, and improvement so that they are 
at present among the superior group scales. They are superior in 
respect to standardization procedure, both intensive and extensive 
analysis of results, and availability of statistical evidence presented 
in the manual. 

The entire series of scales, of which there are nine, graded accord- 
ing to school level, includes thirty-nine subtests. Each subtest as a 
whole is placed according to its over-all relative difficulty in the age 
range; and the items within each subtest are placed in order of their 
difficulty. Since intelligence levels vary considerably within any single 
age group, and since there is overlapping among different age groups, 
there is also overlapping (duplication) of subtests from one scale to 
the next, Thus, the scales for the several adjacent levels include the 


Subtests as indicated below: 


kindergarten, subtests 1—10 
grade 1, subtests 4-13 
grade 2, subtests 8—17 
grade 3, subtests 12—21 
grade 4, subtests 15-24 
grade 5, subtests 19-28 
grade 6, subtests 22-31 
grades 7-8, subtests 25-34 
grades 9-12, subtests 30-39 


The subtest types are not unique, since they include the familiar 
Nonverbal and verbal materials such as sequences, classification, num- 


12? See Review of Educational Research, Vol. 14, No. 1, 1944, Chapter 3; 


Vol. 17, No. 1, 1947, Chapter 2; Vol. 23, No. 1, 1953, Chapter 2. 
13 By Nes in Lets and R. G. Anderson. Published by Personnel Press, Inc., 


Baltimore, 1952. 


286 Verbal and Mixed Group Scales of Mental Ability 


ber concepts, form perception and synthesis, word knowledge and 
facility, scrambled sentences, arithmetical reasoning, verbal analysis, 
etc. In the earlier levels, the materials are nonverbal; but they develop 
gradually into scales that are largely verbal and numerical and, finally, 
that are entirely so. 

Thus, to a considerable extent there is, from age to age, over- 
lapping of the functions being measured by this series of subtests. 
This is quite consistent with the stated purpose of the scale: namely, 
to measure the levels of general mental development needed to suc- 
ceed in school work. In this connection, it should be noted also that 
the authors of the scale do not recommend the use of separate subtest 
scores as measures of separate psychological functions or for guidance 
purposes. But, quite appropriately, they do suggest that significant 
inconsistencies among scores on separate subtests furnish evidence 
of erratic performance, for which causes and explanations should be 
sought. Implicit in this position of Kuhlmann and Anderson is the 
acceptance of the general factor theory of intelligence. This ac- 


ceptance is later made explicit by them in their validation, as will 
be indicated below. 


Validity and Reliability. The usual criteria of validity have been 
used: test performance of retarded, average, and accelerated pupils; 
intercorrelations of subtests; correlations of subtests with total scores: 
means, standard deviations, and ranges of 1Q’s at the several grade 
levels; correlations with quality of school work; and power to dis- 
criminate among successive age groups (age differences being frac- 
tions of a year), upon which heavy emphasis is laid. 

These criteria are, for the most part, reasonably well satisfied: the 
test differences between the known groups of pupils are significant: 
the mean IQ’s approximate what would be expected; for the various 
age groups, the standard deviations of 1Q’s range between 9.5 (age 7) 
and 16.1 (age 14), but more than half of the S.D.’s are between 13 


and 16 points; a table of age norms for each of the thirty-nine sub- 
tests shows considerable discriminative abilit 


between age incre- 
ments of less than a year. j ~ 


Two-thirds of the subtest intercorrelations fall in the .40's and 
.50’s; while of the coefficients for each of the subtests correlated with 
total score, more than two-thirds fell between .50 and .81. These 


Kuhlmann-Anderson Tests (6th Edition) 287 


coefficients indicate that there is marked communality of functions 
being measured by most of the subtests but that at the same time, 
also, some functions beyond the general one are being measured.” 

Reported correlations with school achievement are relatively high, 


being from approximately .60 to about .80. 
Reliability was studied by means of the several familiar methods: 


Odd-even reliabilities (uncorrected) varied from .88 to .95. 
Standard errors of measurement, after several retestings varied 


from 5.5 to 6.5 points in IQ. 
The test-retest “index of reliability” is .90 for the data from which 


the standard errors of estimate were obtained.” 


Since there are time limits for these subtests, the appropriateness 
of the odd-even method of determining reliability might be questioned. 
The authors have shown, however, that odd-even reliability is just 
about as high when all time limits are removed as when the tests are 
timed, the range of coefficients being from .80 to .95. These results 
indicate that the scales are essentially tests of “power” rather than 


“speed.” 


Scoring. The Kuhlmann-Anderson scales use the median mental age 
method, employed with the Pintner-Paterson Performance Scale 
(1917). The procedure is this: a mental age is yielded by each of 
the ten subtests in a given scale; the median of the ten values is taken 
as the over-all mental age. When this method is used the principle 
being applied is that this median value is most representative of an 
individual's general level of performance, especially so since it is not 
affected by a few extremely high or extremely low subtest scores, if 
such occur (whereas the arithmetic mean is affected by such ex- 
not much significance is attached to correlation with Stanford- 
Binet results as an indication of validity. No data are presented in the manual 
on this criterion, although there is brief mention of close correspondence be- 


tw Bi ained with both scales. k ; 
ome oem Wiinbility” is to be distinguished from the “coefficient of 
reliability.” The latter is derived from the correlation of two sets of obtained 
Scores; while the former is the correlation between one set of obtained scores 
and a set of estimated “true” scores for the same population sample. The 
“index of reliability” is always the higher of the two, since it indicates the 
Probable higher limit of correlation rather than the correlation that has ac- 
tually been found for the two sets of obtained scores. An “index” of .90 (as re- 


Ported above) corresponds to a “coefficient” of .81. 


™ Apparently 


288 Verbal and Mixed Group Scales of Mental Ability 


tremes). From mental ages, thus found, and chronological ages, in- 
telligence quotients are determined in the usual manner."® 


GROUP SCALES FOR COLLEGE FRESHMEN 


Some tests of intelligence have been constructed for the spe- 
cific purpose of appraising abilities of individuals with special refer- 
ence to the intellectual demands of college curricula, generally of 
the kind found in colleges of liberal arts and similar institutions, such 
as teachers colleges. We shall briefly describe several of these, prin- 
cipally to familiarize the reader with the types of materials included. 
These tests do not present any new or unusual principles in respect 
to construction, organization, or interpretation. 


American Council on Education: Psychological Examination for Col- 
lege Freshmen.” This scale includes the following familiar subtests: 
arithmetic problems, word definition, figure analogies, same-opposite 
(word meaning), number series, and verbal analogies. These six tests 
are grouped into two classes: (1) linguistic tests 


(same-opposite, 
word definition, and verbal analogies) which yield a 


an L-score; (2) 
quantitative tests (the remaining three) which yield a Q-score. 


The authors state that these two Separate scores may be used in 
educational counseling; for the linguistic tests have been found to 
have higher correlations with scholarship in colleges of liberal 
than do the quantitative scores. This result is d 
very large portion of courses in these colle 
mands upon linguistic activities and think 
scientific and technical curricula, the au 
tests may be more significant. 


arts 
ue to the fact that a 
ges make their principal de- 
ing. On the other hand, for 
thors hold, the quantitative 


The raw scores are converted into Separate percentile ranks for 
total score, L-score, and Q-score. Separate distribution tables are 
provided for men and women. Also, it is not uncommon for a college 
to prepare its own frequency tables and to calculate percentile ranks 
of its students on the basis of their performance alone, rather than 
on the basis of national norms. 


“The Kuhlmann-Finch Tests (1952) are modeled after the Kuhlmann- 
Anderson Tests, in respect to principles and content. Published by Educational 
Test Bureau. 

17 Developed by L. L. Thurstone and T. G. Thurstone. Published by the Edu- 
cational Testing Service; annual editions. A separate edition for high-school 
students is also available. 


Group Scales for College Freshmen 289 


ITEMS FROM 
AMERICAN COUNCIL ON EDUCATION, PSYCHOLOGICAL 
EXAMINATION FOR COLLEGE FRESHMEN (By permission) 


Figure Analogies: 

In Sample 3 below, the rule has two parts: “Make Figure B of the 
opposite color and larger than Figure A.” Apply the rule to Figure C 
and blacken the space which corresponds to the correct answer. 


You should have blackened the space numbered 1, which corre- 
sponds to the large white square. 


Verbal Analogies: 

In each row of words, the first two words form a pair. The third 
word can be combined with another word to form a similar pair. 
Select the word which completes the second pair. On the answer 
sheet, blacken the space which corresponds to the word you select. 


(1) ground (2)sod (3) path (4) blue (5) green 


sky-blue grass- F NUS 
(1) hard (2) fire (3)iron (4) liquid (5) boat 


ice-solid water- 


Number Series: 

Find the rule in the series below, and blacken the space on the 
answer sheet which corresponds to the next number. 
* 10: Ut 12 13 
(a) (b) (c) (d) (e) 


The series above goes by alternate steps of subtracting 2 and add- 
ing 3. You should have blackened space (e), which corresponds to 


13, the next number. 


TO gas to ele 


Ohio State University Psychological Test.’ This scale includes three 
subtests: same-opposite (word-meaning), word analogy, and para- 
graph meaning (answering questions on each of a number of para- 
graphs, to test the subject's comprehension and interpretation of 
materials read). The scale is clearly and specifically devised to test 


verbal intelligence. 


18 Prepared under the direction of Herbert A. Toops. Published by Ohio 
College Association Committee on Intelligence Tests for Entrance, Ohio State 
University; Form 24, 1950, is most recent. This scale has been used with high- 
school students also, for whom separate norms have been prepared. 


290 Verbal and Mixed Group Scales of Mental Ability 


Norms of percentile ranks are provided, based upon scores of bee 
men in a large number of Ohio colleges. Separate tables of norms 
for total scores are given for men and women. Separate norms are 
also provided for the test of paragraph meaning. The reason, ci 
sumably, for the separate norms for paragraph meaning is that thereby 
a subject’s level of performance in respect only to word knowledge 
may be distinguished from his ability to interpret and think in terms 
of linguistic symbols. 

As in the case of the American Council scale, it is not an uncom- 


mon practice for colleges to prepare their own separate norms and 
percentile ranks when using the Ohio State scale. 


College Entrance Examination Board: Scholastic Aptitude Test. 
This scale has two sections: 


a verbal and a quantitative (mathe- 
matical). The former section i 


ncludes tests of word-opposites, word 
analogies, paragraph meaning, and sentence completion (rather long 


and complex ones); the latter section includes mathematical problems, 
involving arithmetic, algebra, and plane geometry. 

The purpose of these tests is not to examine an individual’s knowl- 
edge or mastery of subject-matter covered in high school, even though 
the test items do utilize the tools of learning and thinking acquired 
there. The verbal section is designed to measure understanding of 
words, skill in dealing with word and thought relationships, and 
ability to read with understanding and discrimination, The mathe- 
matics section is designed to measure ability to handle quantitative 
concepts, rather than achievement in the field of mathematics. On the 


whole, scores on the Scholastic Aptitude Test are considered to be 
indexes of probable success in subsequent ac 


courses involving verbal and quantitative m 
the verbal sections, obviously, 
value regarding performance in 
social studies; while the mathematics section has greater predictive 
value in the study of physical sciences and engineering. It appears that 
while small differences between ions, in the case 
of a given person, have little sieni 


19 Prepared by the staff of the College Entrance Examination Board; annual 
editions; not available for use by others, Although this is called a scholastic 
aptitude test, it has much the same content as other tests of intelligence at this 
level and is intended to serve the same purpose. 


li 


Group Scales for College Freshmen 291 


portance in counseling students in regard to their selection of college 
curricula. 

Raw scores on this scale may be converted into percentile ranks, 
according to national norms. Or, again, each individual institution 
may set up its own frequency distribution of scores and corresponding 


percentile ranks. 


ITEMS FROM COLLEGE ENTRANCE EXAMINATION BOARD 
SCHOLASTIC APTITUDE TEST (By permission) 


Sentence completion: 
One of the most prevalent erroneous contentions is that Argentina 
is a country of .-..---+++ agricultural resources, and needs only 


the arrival of ambitious settlers. 


(1) modernized (2) flourishing 

(3) undeveloped (4) waning 

(5) limited 
Precision of wording is necessary in good writing; by choosing 
words that exactly convey the desired meaning, one can avoid 


(1) duplicity 

(3) complexity 

(5) implications 
Yale Educational Aptitude Battery.”’ This series of tests is made up 
of seven parts: (1) verbal facility (the verbal section of the College 
Board Scholastic Aptitude Test); (2) linguistic aptitude (as meas- 
ured by a test of artificial language); (3) verbal reasoning (logical 
and deductive judgment) ; (4) quantitative reasoning (“‘abil- 
ting hypothetical quantitative data so as to perceive 
relations or principles characterizing them and derive ‘laws’ analogous 
to, yet different from, those actually encountered in the study of the 
natural sciences”); (5) mathematical aptitude (mathematical prob- 
lems similar to those in the College Board test); (6) spatial visualizing 


(2) incongruity 
(4) ambiguity 


inference 
ity in manipula 


2 A, B. Crawford and P. S. Burnham, Forecasting College Achievement 
(New Haven: Yale University Press, 1946). The reader will note that the 
authors of these tests have chosen to call them an educational aptitude battery. 
The contents of these scales, however, are very much the same as or identical 
with the contents of other scales variously designated as tests of intelligence or 
of mental abilities, or as psychological examinations. Because the Yale battery 
is similar to these others not only in content but in purpose as well, it is in- 


cluded here rather than in the chapter on aptitude tests. 


292 Verbal and Mixed Group Scales of Mental Ability 


(“representation of three-dimensional forms by two-dimensional 
figures through projections and block-counting”); (7) mechanical 


ingenuity (“problems in gear or pulley movements, structural sta- 
bility, and mechanical operations”). 


The reader is already familiar with these types of items from sam- 
ples of other scales, except numbers 2, 4, and 7. The last of these 
is generally found in scales designed to test only specialized “me- 
chanical aptitudes,” so it will not be illustrated at this point. (See 
Chapter 12.) The following items illustrate tests 2 and 4, respectively. 
ARTIFICIAL LANGUAGE 
Vocabulary: 
I—vlu 
he, it (nom. )—wes 
to be—jahviz 
to read—skraliz 
to have—dromiz 
good—zeyt 
book—stetsleit 
word—gleit 

Rules: 


. Articles are not used in the artificial language. 
. Verbs are not conjugated for person and number, 
. Future—prefix bli to the verb, 


Sample (to be translated) : 


wn 


TBS aA Answers 
I have a book 
A. (1) wes, (2) polvlu, (3) vlu, (4) polwes, (5) vlul 1 2 i453 
B. (1) dromiz, (2) jahviz, (3) amdiz, (4) somiz, 
(5) binotiz 1234.5 
C. (1) glcit, (2) zepoldeit, (3) zeyt, (4) stetsleit, 
(5) oveit 12345 
Quantitative Reasoning 
B 
8 2 
32 4 
18 3 
200 o 
50 


5 
discover and state the relationship be- 
1 n this illustration it is: A = 2(B?). Then: 
at is the value of B? 


The problem is first to 
tween the paired numbers, | 
when A = 72, wh 


Group Scales for College Freshmen 293 


The Yale battery differs from the other scales for college freshmen 
principally in the following respects: it contains a more diversified 
and a larger number of types of items; it attempts to predict college 
success in more specialized areas of study (verbal, scientific, engi- 
neering); and it provides the tables, in terms of points and percentile- 
rank equivalents, for obtaining a profile graph for each person tested. 
On the whole, this battery is intended to reveal individual differences 
in major areas and in educational promise at the higher levels which, 
most persons would grant, depends to a very large degree upon those 
mental functions that have been designated as constituting intelligence. 


Evaluation of Group Scales for College Freshmen. The reliability 
coefficients of the four scales discussed above are of the general order 
of .90. The method usually employed is the familiar one of correlating 
scores on odd-numbered items with those on the even-numbered ones. 

The intercorrelations presented in Table 41 are representative of 
the subtests used at the level of college freshmen. 

Since scales for college freshmen are designed for a specific pur- 
pose—selection of promising students and prediction of college 
achievement—their validity depends upon the success with which 
they perform this task. The scores on these scales, therefore, have 
been correlated with marks obtained in college courses.** The me- 
dians of the coefficients found in recent years are generally in the 
neighborhood of .50 and .60. 

Test scores have been used to predict survival in college, disre- 
garding actual marks. It was found that the scales are useful in help- 
ing to identify individuals at the two extremes: namely, those who 
are most likely to complete their undergraduate work and those who 
are least likely to do so. The test scores do not differentiate very well 
between individuals in the middle range of the distribution (approxi- 
mately the mid-sixty percent), So far as scholastic achievement is 
concerned. The reasons for this are that this middle group is fairly 
homogeneous, that numerous nonintellectual elements affect students’ 
course marks, and that college marks themselves are neither objective 
Nor entirely reliable. The results of correlational and the “survival” 
studies are not such as to warrant exclusive use of psychological tests 
in selection and elimination of candidates for admission to college. 


al Research, Vol. 14, No. 1, 1944, Chapter III; 


22 See Review Education es 
imwa and 1953; also Crawford and Burnham, passim. 


and the first numbers in 1950 


a : ‘lit 
294 Verbal and Mixed Group Scales of Mental Ability 
TABLE 41 


ACE Psychological Examination: 
College Edition, 1948 ™ 
(385 Cases at One College) 
lations, Means, and Standard Deviations of the Six 
pes piet, Q, L, and Total Score 
E Seeks aM SS UO eee pe 
Arithmetic ger oeie ano 432 .558 .427 .297 450 .739 .439 .643 
) Figure Analogies 432 .... 488 .349 .285 .529 836 .442 .690 
) Number Series . .558 488.... .323 .283 498 .847 421 -082 
) Completion .... .427 .349 .323.... .658 574 436 .829 .757 
) Same-Opposite . .297 .285 .283 .658 vata 903. 3351 904 766 
) Verbal Analogies 450 .529 .498 574 .563 611 .822 .836 
) 
) 


CSE. i cn ars ae -739 836 847 436 .351 .611 ies I0. B2 
LScore 4. sss. -439 442 421 829 904 .822 .530 Ronan, ee 
-+++ .643 .690 .682 .757 .766 .836 B27 a EET 
Mean .... 8.16 15.96 16.58 16.53 21.16 26.86 
Tekem 2.91 5.34 4.85 4.14 7.16 5.62 


Percentile rank of means 


40.70 64.56 105.25 
10.72 14.94 22.53 


E EE EET 44 50 47 
Results for all colleges as reported in Norms Bulle- 
tin for 1948, College Edition 
VEN Ei ee AA E a 41.56 64.38 105.91 
Me eae AA N EE NONE 11.39 16.12 24.65 
But they do make a valuable contribution when used in conjunction 
as high-school marks, performance 
and ratings by teachers, 


ARMY GENERAL CLASSIFICATION TEST 2° 


This scale, in various e 


arlier versions, w. 
in World War II to cl 


as used in the Army 
assify men and women acco 


rding to their abilities 
rning in modern military 


» emphasis was placed upon verbal com- 
prehension, quantitative reasoning, 


and spatial perception. The 
AGCT, therefore, adapted and utilized conceptions of testing and 


*! Communication from Educational Testin 
23 Published by Science Research Associate 


sonnel Research Section, Classification and Replacement Branch, Adjutant Gen- 


eral’s Office, Personnel Classification Tests, War Dept. Technical Manual, TM 
12-260 rev., Washington: U. S. Govt. Printing Office, lose : 


8g Service. (By permission. ) 
s, Chicago, 1947. See Staff, Per- 


Army General Classification Test 295 


types of materials that had been developed and in use for a long time 
prior to the war. Three types of test items were employed to measure 
the three processes, respectively: vocabulary, arithmetical problems, 
and block counting. The items are presented in spiral form: a group 
of vocabulary items, then a group of arithmetical problems, then 
blocks. The sequence is repeated a number of times, each group of 
items being of greater difficulty than the preceding one. They are of 
the usual multiple-choice type. 


Scoring. The raw scores might range from zero to 150. These are 
converted into a standard score, so arranged that the mean is 100 and 
the standard deviation is 20. Tables are also provided for conversion 


of raw scores into percentile scores. 
TABLE 42 


Distribution of AGCT Standard Scores ™ 
(N = 160,000) 


Percent 
Score of Sample 
41-59 + 
60-89 23 
90-109 31 
110-129 33 
130-161 9 


Validity and Reliability. Using the odd-even method, the mean of 
the reliability coefficients (corrected) was approximately .95. (N 
varied from 639 to 3856.) When the test-retest method was used, 
the reported reliability was .82. i 

The AGCT scores correlated .73 with number of school grades 
completed. It correlated as follows with other tests: Army Alpha, 
.90; Otis Higher Mental Ability Examination, .83; American Coun- 
cil on Education Psychological Examination, .79. That the AGCT 
Scores are not in part a function of chronological age is shown by a 
correlation of .02 with CA (N = 4330). 

Test scores were correlated with training-school marks in several 
hundred training programs. The coefficients that follow are based 
upon groups of trainees who had been preselected for a particular 


2% Erom Examiners Manual, AGCT, p. 3. Science Research Associates. (By 


permission. ) 


296 Verbal and Mixed Group Scales of Mental Ability 


school on the basis of education and civilian occupation. The m 
cients are lower, therefore, than they would have been if an entire y 
unselected group of trainees had been used. Correlations sins 
clerical trainees, .40; airplane mechanic trainees, .35; sheet meta 
trainees, .27; radio operator-mechanic trainees, .32; officer candidates 
(various services), .40. 


The ranges of scores, classified according to civilian Occupation of 
the subjects, show such extensive overlapping of groups (interquartile 
range, also 10th and 90th percentile range) that occupational dif- 
ferentiation would be extremely doubtful, except in instances where 


occupations being compared are very widely separated on the scale. 
The army versions of this test we 


under pressure of an emergency situati 
satisfactorily. But adaptation of the 
a very doubtful Procedure; for superior si 
tests are available. These scales an 
much time and careful interpretatio 
situations that are free from press 
travagance with human and mater 


d batteries, which often require 
n, are to be preferred in civilian 
ures for speed and in which ex- 
ial resources is discouraged. 
MILLER ANALOGIES TEST 2° 


Originally devised in 192 
measure scholastic aptitude 
of one hundred items, with 
speed factor is said to be of 
analogies covering a wide var 
tion. Although some quantit 
the items are predomin 
ships to be discerned w 


6, this test has 
at the gr. 


S; but the 
St includes 
Specializa- 
aterials are used, 
lso, the relation- 


i l 3 e test is not circulated 
and its use is controlled. ) 

Items were retained for their di 
by an item analysis of results ob 


- Published by The 


newer form (H), 1950, is also available. 


Psychological Corporation, 1947, A 
tically equivalent. 


The 1947 and 1950 forms are prac- 


Miller Analogies Test 297 


These items were then administered to 770 entering graduate students 
at the University of Minnesota. The latest revision was made on the 
basis of an item analysis of their scores, the items being arranged in 
their order of difficulty as determined by percentage of errors. These 
students represented a wide range of major fields of study. 


Validity and Reliability. Coefficients of reliability, found by the odd- 
even score method, for three groups of graduate students, were .93, 
.93, and .92, as corrected by the Spearman-Brown formula. The 
numbers were 100, 162, and 125. 

Validity is determined largely by ability of the test to predict suc- 
cess in graduate study. The correlation coefficients between the Miller 
Analogies Test and numerical marks, as reported in the manual, are 
restricted to University of Minnesota graduate students in one field: 
namely, education. For course grades, the twenty-four coefficients 
ranged from .14 to .78. (With one exception, all were .35 or above.) 
The median was .54. Ten correlations with grades in final compre- 
hensive examinations ranged from .28 to .54, with a median of .39. 
Although correlation coefficients are not reported, the data show an 
increase in mean test-score as honor point-ratios increase, Validation 
studies have been made subsequently at other universities. The cor- 
relations found at these institutions, in other fields of graduate study, 
Were of approximately the same magnitude. 

When the analogies-test scores were correlated with average scores 
on seven parts of the Graduate Record Examination (mathematics, 
Physics, chemistry, biology, history, government and economics, lit- 
erature and fine arts) the coefficients were much higher. These ranged 
from .64 to .84, with a median of .775. Correlations with the ad- 
vanced Graduate Record Examination (fields of specialization) 
yielded coefficients of .81 and .79 for major students, respectively, 
in chemistry and languages-and-literature. Correlations with the 
verbal parts of the Graduate Record Examination were from .74 to 
-81, the median being .80. 

This array of coefficients adds up to the conclusion that the Miller 
Analogies Test is a useful additional source of evidence in regard to 
a person’s ability to pursue graduate study. Combined with results 
of the Graduate Record Examinations, the value of each is increased. 
The lower coefficients obtained with actual course grades are very 


298 Verbal and Mixed Group Scales of Mental Ability 


probably due to the variations in marking, the differences n aah 
culty of courses, specialization of interests and abilities and influenc 
of nonintellective factors that affect a student’s level of eet, 
This analogies test is a carefully constructed instrument, devise 
for use with a difficult problem: namely, differentiation among in- 
dividuals in a rather highly selected group. The ability tested by 
means of these analogies is one factor in the prediction of success in 
graduate studies: namely, ability to learn verbal and other abstract 
concepts and course materials. What the analogies do not evaluate 
(nor do they purport to do so) is original thinking and constructive 


research ability—both of which should rank very high in graduate 
studies, 


OTHER GROUP SCALES 


It is our purpose 
scales which will help to 
ception of representative 


> in this section, to refer briefly to several 
give the reader a more nearly adequate con- 
group instruments. 


Institute of Educational Research I 
test was develo 


E. L. Thorndik 
gets its name 
intellect specifi 


ntelligence Scale CAVD. This 
ped at Columbia University under the direction of 
e and frequently is designated by his name. 
(CAVD) from the fact that it proposes t 
cally by means of four kinds of ment 
tion (C), ability to supply words so as to make a st 
sensible”; arithmetical problems (A); voc 
understand single words; directions (D), a 
nected discourse as in oral directions or p: 

The distinguishing feature of this scale 
ranged in order of difficulty, providing seventeen different levels, in 
each of which the tasks of any one subtest (e.g., C) are of nearly 
equal difficulty. Also, the steps between levels are of approximately 
equal difficulty. The lowest level is suitable for three-year-old chil- 


dren while the highest levels are intended for superior adults. Thus 
the scale is designed to test the same functions in a continuum from 
early childhood through adulthood. 


The range of difficulty in the same 
be illustrated by the following example: 


26 


The scale 
o measure 
al tasks: comple- 
atement “true and 
abulary (V), ability to 
bility to understand con- 
aragraph reading. 

is that the items are ar- 


category of mental tasks may 
S. 
Published by Bureau of Publications, Teachers College, Columbia Uni 
versity, 1925. Norms are not available for levels below ninth grade population. 


Other Group Scales 299 


Completion, lowest level: 

You are sitting on a ..-.+---+-- 
Completion, one of the highest levels: 

Throughout the river plains of northern India, two harvests, and, 
some provinces, ..-----.-- ALCL enn jie 42 each 


Arithmetic, lowest level: 
Counts two pennies 
Arithmetic, one of the highest levels: 
A factory earns $70 a day for its owner when it is working full 


capacity and $15 a day when it is working half capacity. In how 
many days will it earn $1,000 if two days out of every three are 
only half capacity? 


Vocabulary, lowest level: ; i ; ; 
“Show me the horse” (to be indicated in a series of pictures). 


Vocabulary, one of the highest levels: ) 
Accolade [means] (1) salutation, (2) anchovy, (3) procession, 
(4) bivouac, (5) acolyte. 

Directions, lowest level: 

“Make a ring like this,” showing act. 

Directions, one of highest levels: : = 
A rather long paragraph, entitled “The American State,” is read. 

The following is one of the questions asked: “To what may we 

attribute the similarity between the plans of certain cities and the 


arrangement of the States?” 

with the CAVD scale have been correlated with 
those obtained by the same persons on other group scales and on 
the Stanford-Binet. The coefficients obtained were high, and in some 
instances very high. However, the time required to take the CAVD 
scale is much longer than that required for the others with which it 
correlates so highly. Due to the time factor, this scale is not so widely 


employed as some of the others. 


Results obtained 


Examination for High-School Graduates. 
This test is extremely long (requiring about three hours) and labori- 
Ous to score, as well as expensive.” It includes tests of highly spe- 
cialized information, arithmetic, algebra, and paragraph meaning 
(which constitutes more than two-thirds of the examination). The 
information and algebra tests may be criticized as being dependent 
upon highly specific school learning; and the tests of paragraph inter- 


The Thorndike Intelligenc 


2 By E. L. Thorndike. Published by Bureau of Publications, Teachers Col- 
lege, Columbia University; most recent series, 1931-1936. 


300 Verbal and Mixed Group Scales of Mental Ability 


pretation may be criticized as being dependent upon a high-level 
vocabulary which requires unusual educational opportunities, either 
formal or informal. Thorndike, however, intends that the examina- 
tion shall be used with “candidates who have had good educational 
advantages and who know English as a mother tongue.” The pre- 
sumption, then, is that, having had these advantages, prospective col- 
lege students will manifest intellectual competence and promise in 


terms of their abilities to deal with such mental tasks as are presented 
in this examination. 


Henmon-Nelson Tests of Mental Ability. Though one of the older 
instruments, this test is still rather widely used.” Stand: 
three educational and age levels (grades 3-8, 7-12 
items are arranged in “spiral omnibus” 
to measure general scholastic intelligen 
materials. Of secondary importance (i 


quantitative items; and of distinctly minor significance are the non- 
verbal (spatial relations) items, there being, for example, only 10 


of these in 90 at the level of grades 7-12. The verbal and numerical 
items are of the familiar Variety: 


scrambled sentences, number sequ! 
The Henmon-Nelson scales r 
of general ability, 


ardized at 
, and 12-16), the 
form. These scales, intended 
ce, are weighted with verbal 
n terms of numbers) are the 


not under language 
ales upon which a 
hey are now in need of 


“spiral omnibus” test is 
arranged. 


1. Which word does not be 
(1) Ida, (2) Paul, (3) Lucy, (4) Janet, (5) Edith 

2. Better is to good as worse is to: 
(1) very good, (2) medium, (3) b 

3, 1, 6, Tlg 16; è mpi wrai Sie 
dotted line? 
(1) 21 and 26; (2) 17 and 25; (3) 26 and 29; (4) 22 and 2T 
(5) 20 and 25. 

4. It was raining too hard to .... out. A word for the blank is: 
(1) comment, (2) gather, (3) venture, (4) summon, (S) render. 


long with the others? 


ad, (4) much worse, (5) best 
What two numbers should be on the 


238 By V. A. C. Henmon and M. J. Nelson. Published by Hou ghton Mifflin, 
1931-1950. TRS 


Evaluation of Group Scales 301 


The Pintner Tests. Another series is that of which Pintner was the 
senior author.2? The Pintner-Cunningham Primary Test (1946) 
covers the range from kindergarten through the first half of grade 2. 
The Pintner-Durost Elementary Test (1940) is devised for the last 
half of grade 2 through the first half of grade 4. The Pintner Inter- 
mediate Test (1938) is for the latter half of grade 4 through the first 
half of grade 9. The Pintner Advanced Test (1938-1939) begins 
with the ninth grade and continues through adult levels. 


The Otis Group Intelligence Scale (1919). This test provides two 
examinations.” The Primary Examination is designed for the range 
extending from kindergarten through grade 4. The Advanced Exami- 
nation extends from the level of grade 5 through grade 12. Otis has 
also devised several other forms of his tests, which differ from the 
foregoing chiefly in respect to their mechanical features; some are de- 
signed to be “self-administering,” and others to facilitate “quick scor- 
ing.” 

There are other group scales which are as well constructed, as 
valuable, and as widely used as some described in this chapter. It has 
not been our purpose to present a complete listing and description of 
group scales; it is unnecessary to do so, since those that have been 
described herein contain all the essential features of current group 
tests. It has been our purpose, rather, to familiarize the reader with 
anization and content of group scales, their statistical merit, 


the org: 
d with the major similarities and differ- 


their scoring techniques, an 
ences that exist among them. 


EVALUATION OF GROUP SCALES 

th Individual Scales. Group scales were devel- 
Oped to permit the testing of large numbers of persons at one time. On 
the whole, therefore, they are not so useful as are individual scales 
(eg., Stanford-Binet and Bellevue) in studying an individual case. For 
When a group scale is used, it is not possible to observe a person’s 
approach to the solution of problems, nor his behavior under success 
and failure. Nor is it possible to evaluate the qualitative characteristics 
of his responses, since group scales are scored quite rigidly. Further- 


Comparison wi 


=° Published by World Book Co. 
* Published by World Book Co- 


302 Verbal and Mixed Group Scales of Mental Ability 


more, it is difficult—in fact, practically impossible—to know whether 
an individual is exerting his maximum effort when taking a a 
examination. Thus when a group scale is given, it is possible to bhp: 
the test results only in terms of numerical indexes (plus profiles, a 
times), whereas during an individual examination the psychologist is 
able to make behavioral and qualitative observations of considerable 
ei all group scales, below college level, have been validated 
against individual scales—especially the Stanford-Binet—as one of 
the principal criteria. This fact in itself is a recognition of the merit 
of the individual scale, the quality of which the group scale is trying 
to approach as closely as possible. Other criteria of validity are the 
familiar ones discussed in earlier chapters. 

In discussing the definitions and analyses of intelligence, we stated 
that one deficiency of all tests is that they do not measure the creative 
aspects of intelligence; nor do they directly measure the insights that 
come from experience (“wisdom,” “judgment”), or productive think- 
ing, or the intellectual originality of an individual. This deficiency is 


more marked in group than in individual scales because of the rigidity 
of scoring the former. 


Theoretical and Statistical Bases. Most group scales are based, im- 
plicitly at least, upon the “general factor” theory of intelligence; for 
most of them undertake to sample a person’s mental activities by 


means of several kinds of tasks and then to rate the individual by 
means of a single index. A few scales are based upon the group-factor 
theory. 


Since many group tests, of varying quality, have been published, 
it is essential that Prospective users examine the m 
determine which of these satisf 


manded of them. The re 
and methods of establish 


anuals closely to 
y the standards that should be de- 
ader is already familiar with the standards 


ing reliability and validity. These should be 
rigorously applied to group tests. In this con 


the manual state which method was used 
especially if the speed factor seems to be a significant one. 

Since group tests for children and adolescents are used primarily 
to assist in dealing with educational problems, it is essential that the 
scale’s predictive efficiency, with regard to school work and progress, 
be reported as one criterion of validity. 


nection it is essential that 
in determining reliability, 


Evaluation of Group Scales 303 


Scores from a scale as a whole are more reliable and more valid 
than subtest scores. A distinction should be made, therefore, between 
subtest reliability and validity, on the one hand, and total scale re- 
liability and validity, on the other. This distinction is especially perti- 
nent when a scale’s subtest scores are to be used for differentiating 


and diagnostic purposes. 
The manual should give not only the size of the standardization 


population sample, but the characteristics of that sample should be 
specified: namely, geographic and socio-economic distributions, range 
of ages, range of ability levels, range of school levels, and sex dis- 


tribution. 


Criteria of Evaluation. In evaluating a group scale with a view to its 
possible usefulness in a given practical situation or in the solution of 
a theoretical problem, it is customary to use the following criteria: 


It must be sufficiently valid and reliable. ; 

The range of norms must be adequate for the group for which the 
Scale is devised. 

The item-difficulty in each 
differentiate between the various 
lowest and highest levels should be 


scores, E 
In general, the range of ability to be tested (ages and school 
g! , £ 


grades) should be restricted rather than all-inclusive. By restricting 
the range, a given number of items and a given length of time can be 
used for a more thorough and accurate examination than if a scale 
of the same length were employed to cover a wider range. In the 
latter instance, the test items would have to be spread more thinly. 

Length of the scale must be adequate. In time required, scales vary 
from about one-half hour to three hours, depending upon levels for 
which they are intended. The great majority of scales require one 
and one-half hours or less. Increase in length, to an optimal point, 
adds to the validity and reliability of a scale; for errors of measure- 
ment are decreased (better sampling) as length is increased to an 
optimal point. Judging from current practices, based upon experi- 


Ment, optimal lengths appea 
kindergarten and primary gra! 
of elementary grades, and up to 


at higher levels. n i 
Simplicity of responses 1S frequently regarded as an asset in group 


tests. For some purposes—when group trends are sought, rather than 

individual performance—this is an asset simply because scoring is 

facilitated. But, as already pointed out, such simplicity and conse- 
: , 


ach subtest must be of sufficient range to 
levels of ability. Individuals at the 
able to obtain representative 


r to be about a half-hour at the level of 
des, about forty-five minutes at the level 
about an hour or an hour and a half 


304 Verbal and Mixed Group Scales of Mental Ability 


quent rigidity may limit the value of tests when evaluation of an 
. rae >, j i d. 

dividual’s responses is desire : rk 
3 Simplicity of scoring is also frequently considered to be an asset, 
since it is actually a result of simplicity of responses. The same com- 
ments apply here as above. ] 

Ease of administering a group scale is desirable. Frequently, group 
scales have to be given by relatively inexperienced persons; it should, 
therefore, be possible to train them in a brief time to administer the 
scale accurately and with precision. Also, simplicity of instructions 
and procedures in giving an examination to a group reduces the pos- 
sibility of confusion and misunderstanding on the part of individuals 
in the group. i 

The examiner's manual should be clear and complete in respect to 


standardization procedures and results, nature of the content, direc- 
tions for administering and scoring, 


norms, and interpretation of 
results. 

The content of the tests should be interesting to the groups for 
whom the scale is intended. 


The content of the tests should be appropriate to the subjects 
being examined. That is to say, the psychologist must determine 
whether or not, in a given instance, it is desirable to use a scale 
which is entirely verbal, or entirely nonverbal, or verbal and quanti- 


tative, or mixed. His choice of scale will depend upon who are the 
subjects to be tested and the Purpose for which the test is being 
given. 


USES OF GROUP SCALES 


sh at this point only to men- 
e been put. 

purposes of general survey, 
ability classification of pupils idance. Under general survey, 
ing: range and distribution of 
lity; ng of ability; differences be- 
tween pupils in various schools within the same community; differ- 
ences between pupils in different school 


town, and country children, 
In classifying pupils according to ability level for the purpose of 
differentiated instruction, a test of 


mental ability is, of course, basic, 
though it should not be the only criterion. 


Uses of Group Scales 305 


Since relatively very few schools include a qualified psychological 
examiner on their staffs, and since extensive individual examination 
is costly and time-consuming, group scales are being used for most 
guidance purposes. However, in view of the fact that group-test ratings 
may indicate only the approximate level of an individual’s mental 
ability, they must be used in conjunction with other available evidence 
obtained from school records, teachers’ reports, objective achieve- 
ment tests, and interviews. But there is no doubt that psychological 
test ratings, correctly obtained and interpreted, tell us much more 
about a pupil’s mental alertness and organization of abilities than 
could be ascertained without their use. 

Group tests have been applied extensively to a large number of 
theoretical and practical problems of psychological, educational, and 
sociological significance, such as: individual differences in relation 
to sex, racial, and national membership; mental levels and character- 
istics of special groups, such as the mentally deficient, the gifted, and 
the delinquent; employee selection for jobs requiring different levels 
of ability; family similarities and the inheritance of intelligence; effects 
of changed environment upon mental level; the nature and course of 
mental development; the nature and organization of intelligence; 
constancy of the IQ and prediction of ability; and problems of theory 
and technique, such as the relationship between “speed” and “power” 
as aspects of intelligence. Then, of course, there was the vast use of 
group scales in the armed forces, during World War II, for “screen- 
ing” and classification of enlisted and commissioned personnel. 

The foregoing enumeration is not exhaustive; but it suffices to show 
the wide range of application of group tests of mental ability; and it 
explains why tests are under continual scrutiny in an effort to increase 


their validity and reliability. 


I 2e 


muv 
BSS SC NNNUNN UNUUNU NUUNUU NUNN 
awww 


APTITUDE TESTS: 
MECHANICAL AND CLERICAL 


DEFINITION AND EXPLANATION 


An aptitude is a condition or combination of characteristics 
indicative of an individual's ability to acquire with training some 
specific knowledge, skill, or set of responses, such as the ability to 
speak a language, to become a musician, to do mechanic 
An aptitude test, therefore, is a device designed to indica 
potential ability for performance of a certain ty 
Specialized kind and within a restricted range, 

Aptitude tests are to be distinguished from those of general ability, 
such as we presented in earlier chapters, and also from tests of skill 
or proficiency acquired after training or experience, They should be 
distinguished, too, from educational achievement tests which are 
designed to measure an individual's quantity and quality of learning 
in a specified subject of study after a period of instruction, 

The reader should note that aptitude is differentiated from skill 
and proficiency. Skill means the ability 
ease and precision, Proficiency has muc 
that it is more comprehensive: 
tain types of motor and ; in other types of 
activities as shown by tl x 2 


bookkeeping, history, economics, m 


al work, etc. 
te a person's 
pe of activity of a 


oficiency in any type of perform- 
"gs aptitude 


Definition and Explanation 307 


for a given type of activity, we mean the capacity to acquire profi- 
ciency under appropriate conditions; that is, his potentialities at 
present, as revealed by his performance on selected tests which have 
predictive value. 

Furthermore, when we speak of a person’s aptitude for a specified 
activity, we do not make any assumptions regarding the degree to 
d upon innateness or acquisition. In giving an apti- 
tude test to a person we desire to obtain a measure of his promise or 
essential teachability in a given area. While we make no assumptions 
regarding the roles of nature and nurture in this matter, we, as clini- 
cians or guidance counselors, cannot ignore that person’s past experi- 
ence in evaluating his performance on aptitude tests. For example, 
one method of measuring mechanical aptitude is by means of a me- 
chanical assembly test, utilizing various common objects such as a 
bicycle bell and a door lock. It is inconceivable that a boy who in 
the past has had opportunity to manipulate such objects will not 
achieve a higher score than if he had not had such experience. Test- 
ing instruments measuring engineering aptitude include, for exam- 
ple, tests of simple mathematical relationships, scientific vocabulary, 
common scientific principles. and problems of practical mechanical 
insight, Here again, an individual's performance will be influenced 
4 ience. This aspect of aptitude testing and inter- 


by his previous exper! ; 5 
pretation will become clearer as the reader becomes acquainted with 


the nature and content of aptitude tests. 
The principles underlying aptitude tests are the same as those em- 


ployed with tests of intelligence in respect to sampling of perform- 
ance, population samples, and standardization techniques (including 
reliability and validity). Therefore, we shall not present the several 
aptitude tests in statistical detail. It will be our purpose, rather, to 
describe the kinds of activities or functions most commonly exam- 


ined by available tests of this type- 
ARING 


which they depen 


TESTS OF VISION AND HE 

Quite aside from the general desirability of good vision and 
hearing, there are numerous occupations and forms of learning in 
which one or both are essential at a high level; thus, they are aspects 
of certain aptitudes. Sensory deficiencies, furthermore, may adversely 
affect an individual’s achievements in schoolwork or in his social and 
emotional adjustment. Hence, in some cases they might play a sig- 


8 Aptitude Tests: Mechanical and Clerical 
30 


nificant part in clinical work and in vocational and Prae 
guidance. Tests are available for visual acuity, color vision, an 
è = 4 

auditory acuity. 


Color Vision Tests. All such tests depend upon the Principle that 
color-deficients confuse certain groups of hues, inter se, while a 
normal person distinguishes them. Thus, one set of charts is so devised 
that persons with unimpaired color vision should see certain bars, or 
arms, radiating from the centers of the circles. In one of the circles, for 
example, a person having unimpaired color vision will see two r: 


adi- 
ating arms: one green and one red. A red-blind e 


ye will see only the 
red-green blind eye 
vised that one with 


a condition which is al- 


ple, in the Navy, during World War II, 
about fifty percent of all color-deficients remained undetected after 


is now complete on a color vision i i 
stable despite variations in illumination.? 
Tests of Visual Acuity. These have been taken by ne: 


arly everyone, 
the most familiar being the crude Snellen Chart. On this chart are 
printed rows of letters, varying in si 


Each row and size has bee: 
distance by the “normal È is ‘ 
the numerator is the dist j 

ally 20 feet), and the value” of the 
ted. “Distance 


; y feet is read by the “normal 
eye” at forty feet k (in that eye) is given as 20/40. 
Chart, though still used, is a very inadequate 

'E. Freeman, “An Illuminant-Stable Col 


or Vision Test, g” 
Optical Society of America, Vol. 38, 1948, PP. 532-538. This te: 
by The Psychological Corporation. 


Journal of the 
st is distributed 


Tests of Vision and Hearing 309 


test; for it will detect only the myopes, but not the hyperopes and 
presbyopes, nor those heavily handicapped by muscle imbalance. 
For some years, the only available device for more thorough non- 
clinical testing of near and distance visual functions was the Keystone 
“Telebinocular,” the accuracy and dependability of which have been 
seriously questioned. Just recently, however, a new, much more effi- 
cient device, known as the Protometer, has been developed. It is 
administerable by nonclinical persons for testing both near and 
distance visual functions (acuity, muscle balance, depth perception, 


and color).? 


ric, 12.1. The Protometer, de- 
signed and developed by Ellis Frec- <gfy 
man, (Patented.) (By permission.) : 
The Protometer is designed for 
rapid and comprehensive testing of 
vision where numerous individuals 
are involved, as in schools, colleges, 
industries, and the armed forces. 


The Protometer gives, among other 
data, monocular acuity, binocular acuity, and muscle balance, both for 


distance and for near—all under proper conditions of illumination and 
Viewing maintained at a constant level. The Protometer discloses cases of 
serious impairment of vision where the trouble is not with acuity but with 
the failure of the two eyes to work in coordination. E 

The speed of operation of the Protometer is due to the fact that it is 
basically two Brewster stercoscopes: the optical system of the one for 
distance over the optical system of the one for near. As long as the sub- 
ject is taking the distance test, it alone is illuminated. W hen the target 
for the near test is to be viewed, the target light for the distance test is 
automatically extinguished while the target light for the near test is auto- 
matically turned on. The tester, operates only one control knob, which 
presents in sequence the test series for both distance and near and at the 
same time controls the illumination. The second knob is merely for ad- 
justing the eye-height of the ocular to the subject, when this is occasion- 
ally necessary; the entire instrument is raised or lowered on a rack or 
pinion. : . 

The Protometer, weighing 10 pounds, is casily portable and can be used 
wherever it may be plugged into current. The operator is a teacher, nurse, 
clerk, or other nonprofessional person who is taught to operate the instru- 
ment, read the questions from the record card, and enter responses. The 
record card itself informs the operator of the results, which may indicate 


? Distributed by Freeman Technical Associates, 1206 Benj. Franklin Dr., 


Sarasota, Fla. 


o Aptitude Tests: Mechanical and Clerical 
31 


either satisfactory vision or deficient vision requiring referral for pars 
sional examination. The Protometer test is administered within two mi 
APE UE e of the Protometer consists of the basic ien Te 
professional refraction and has the same high precision. For this reas mi 
it is extremely dependable as a preliminary diagnostic device. It is nae 
equipped with means for complete diagnosis and prescription of corre 
tion. For these latter a full-scale professional referral is necessary. 


Since examinations of men in the last war have shown that almost 
ten percent of all males are color-deficient in some degree, it seems 
desirable to test all school children very early for this function. Defi- 
cients could, for example, be diverted from trying to become artists, 
geologists, clothes designers, etc. It is wise, also, to use a dependable 
color test in personnel divisions of department stores and of some 
industries; in the former to avoid placing color-deficient sales persons 
in the wrong departments; in the latter to avoid placing a colorblind 
individual on, say, radio or other wiring which requ 
of color code. For a job in which some color 
tolerated, the demands of the job in regard to c 
should be determined, as well as the candidate’s d 

Good vision as such is desirable without q 
schools and industry there should be screening by 
device to find those who need correction or whos 
need to be given consideration in education, 


ires discrimination 
deficiency can be 
olor discrimination 
egree of deficiency. 
uestion. Hence, in 
means of a reliable 
e color deficiencies 


Auditory Acuity. asured (1) 
by me 


onants and 


percentage 
al” percentage. The 
© whispered speech; 
d by acoustics of the 
ice consists of a disc 
presents numbers of 
mbers through head- 
Sted separately, Suc- 
i tensity, at small uni- 
“normal 


which are superior t 
for they are of measured intensity and unaffecte; 


room in which the test is being given. The dev 
and phonograph with magnetic reproducer that 
two or three digits each. The Subjects hear the nu 
phones and write them down, each ear being te 
cessive numbers are reprod 
form steps, until a minimum i 
ear,” is reached. The cycl 


Motor and Manual Tests 311 


are spoken in a male voice and four in a female’s. There are tests, 
also, of sentences, words, and pure tones. 

It is obvious that these tests of vision and hearing do not measure 
a person’s aptitude for specific types of learning and activity. For 
certain kinds of learning and activity, however, a given degree of 
visual or auditory acuity is essential. In that sense, then, these de- 
vices may constitute a part of a battery of tests which, taken together, 


are used to measure a particular aptitude. 


MOTOR AND MANUAL TESTS 

Tests of Strength of Grip. One of the oldest instruments for 
the measurement of individual differences in the psychological lab- 
oratory is the hand dynamometer for measuring strength of grip. The 
instrument consists of an inner and an outer handle, a dial, and a 
pointer, The subject grips these handles so that the second phalanges 
of the fingers press against the inner handle, while the outer handle 
Presses against the heel of the hand. The subject then squeezes as 
hard as possible. Strength of grip is measured in kilograms. After 
many experiments, it appears that in psychological work this instru- 
Ment is useful principally as one device for determining degree of 
tigue. Since these two traits are involved in 


handedness and rate of fa : 
they are relevant in some aspects 


certain activities and occupations, 
of aptitude testing. 


Tests of Reaction Time. Reaction time is the time interval between 
the onset of a stimulus and the beginning of the person’s overt in- 
tentional response. The particular stimulus and response to it are 
Prearranged in an experimental situation. For example, the subject 
may be instructed to tap a telegraph key immediately upon per- 
ceiving a red light, the elapsed time between stimulus and response 
being electrically recorded in terms of thousandths of a second. It is 
possible to devise a variety of tests, their particular character depend- 
ing upon which sensory and motor functions are to be measured. 
This type of test obviously is intended to measure speed of response 
in situations demanding immediate reaction, as in certain machine 


Operations and in driving an automobile. 
Tests of Manual Dexterity. In order to achieve competence in ac- 


tivities requiring manual dexterity, speed of gross movements of hand 
and arm, manual rhythm and coordination, and finger control and 
, 


Aptitude Tests: Mechanical and Clerical 
312 


inati in varying degrees. For each of these 
aa git fesieliags boos eee which vary in detail but are 
ae val alike. Gross movements of hand and arm may be 
poner in nih of speed with which the subject picks up and places 
oylinärical blocks in holes in a board. Finger dexterity and coordina- 
A necessary in rapid and accurate manipulation of objects, may 
be tested by measuring the rate at which an individual, with fingers 
or tweezers, is able to pick up small cylindrical metal pins or wooden 
pegs, of different shapes, and place them in the holes of a tray. (See 
Fig. 12.2.) Hand precision is measured by the accuracy with which 


ric. 12.2. The plier dexterity test 
shown here is useful in evaluating 
skill in the use of small tools and, 
in general, in evaluating aptitudes 
involving finger dexterity, The tray 
contains metal pegs which must be 
placed in the small holes in a pre- 
scribed order. The score is based 
upon time required to complete 
the task. Sometimes the time re- 
quired to remove the pegs is also 
included in the score. (Acme 
Photo.) 


a metal stylus can be placed into holes of small diameter cut in metal 
and electrically connected. Contacts of the Stylus with rims of holes 
are electrically recorded and constitute the measure of inaccuracy. 
Occasionally, also, a Paper-and-pencil test includes tasks designed 
to measure hand Precision, such as speed and accuracy of tracing a 


path, speed of tapping, and placing a prescribed number of dots 
within a small circle? 


Other tests of manual dexterity follow the same general form, but 


x. For example, the Small Parts Dexterity Test * 
consists of a metal plate i 


* For example, the MacQ 
California Test Bureau, 1943 


uarrie Test for Mechanical Ability. Published by 


*By J. E. and D. M. Crawford. The Psychological Corporation, 1949. 


Motor and Manual Tests 313 


purpose of this test is to measure a combination of perception and 
dexterity, in terms of rate of performance. 

The Stromberg Dexterity Test* also is a device the purpose of 
which is more complex than the simple measurement of manual 
dexterity. It consists of a tri-colored formboard (6 rows and 9 col- 
umns), into which flat, cylindrical disks, variously colored, are to be 
placed by the subject in a prescribed order. It appears that this test 
involves not only manual dexterity, but also gross color perception 
and a rather elementary level of nonverbal classification. 

Since manual dexterity scores on tests of this type are affected, 
in varying degrees, by the subject's lateral dominance (the preferred 
use and superior performance of one side of the body or the other), 
it is often desirable to use tests of hand and eye dominance. This 
Procedure is particularly indicated if we are concerned primarily with 
analyzing and understanding the person being tested, rather than 
with making selections from among candidates for a particular job.° 

Coordination and rhythm of hand movements have been tested 
by means of a card-sorting test in which the subject uses one hand at 
a time or both hands together in dropping playing cards through slots. 
A more recent device is the two-hand coordination test in which the 
individual attempts to move both handles of a mechanism simultane- 
ously in such a way as to keep an upper disk over the lower one, 
which moves in an unpredictable manner.’ Another two-handle device 
is employed in testing a subject’s ability to follow an irregular path 
Without touching the sides. 

During World War II, numerous psychomotor tests were used by 
army and navy psychologists to assist in the selection of men for 
Specific types of training, especially in the air forces.’ These tests in- 
volved more difficult operations than those described above, often 
requiring rapid and complex sensory-motor coordination, such as the 


rg. The Psychological Corporation, 1951. 
A. J. Harris, Tests of Lateral Dominance; W. R. Miles, 


The A-B- “cion Test. The Psychological Corporation. 
7 rar Sein “The Selection of Pilots by Means of Psychomotor Tests,” 


Journal Nation Psychology, Vol. 15, 1944, pp. 116-123. 
Dea E e eE see G. K. Bennett and R. M. Cruikshank, A Sum- 
mary of Manual and Mechanical Ability Tests, New York: The Psychological 


orporation, 1942. 
® Staffs, Psychologica 
School of Aviation Medicine, 
Army Air Forces,” Psychological Bulle: 


* By E. L. Strombe 
ê See, for example, 


| Research Unit No. 2, and Department of Psychology, 
“Research Program on Psychomotor Tests in the 
tin, Vol. 41, 1944, pp. 307-321. 


Aptitude Tests: Mechanical and Clerical 
314 


ing: hands simultaneously in manipulating two 
eee Fie ota a target which moves in an irregular path: 
phas patterns of lights by manipulating stick and spins as 
simulated airplane cockpit; reacting to four different relative posi 10 : 
of a red light and a green light by pushing one of four switches ar 
ranged in a square pattern before the subject; moving a wheel, re- 


ric. 12.3. The Stromberg Dexterity 
chological Corporation 


Test. The Psy- 


sembling an airplane control, in and out of its shaft in order to hold 


controlled by one lever and the vertical by another. 


The tests of sensory capacity and those of motor and manual dex- 
terity developed prior to and during World War II h 


Motor and Manual Tests 315 


yielded coefficients which are very low, some being so low as to be 
negligible. In fact, on occasion a low negative coefficient has been 
found. It has been concluded, therefore, that these two types of psy- 
chological instruments measure functions which are largely inde- 
pendent of each other. 


SIGNAL. LAMP ` 
@ & 


Map eo 
RED? X a C 7.GREEN 


4 > 
RENC y Prep 


ric. 12.4. The Discrimination Reaction Time Test. 
This test was designed to measure how quickly in- 
dividuals make differential manual responses to visual 
stimulus patterns differing from one another with 
respect to the spatial arrangement of their compo- 
nent parts. The test requires that the candidate react 
by pushing one of four toggle switches in response 
to the lighting of a red and a green signal lamp. The 
position of the red lamp with respect to the green 
determines which of the four switches should be 
La gee view of a single test unit, with designations 
of lights and switches, is shown in the figure. The 
four stimulus lamps, two red and two green (L1, L2, 
L3, L4), are arranged in the form of a square on the 
vertical panel facing the candidate. The stimulus to 
which the candidate must react by operating one of 
the four toggle switches is the simultaneous lighting 
of one of the red lights and one of the green lights. 


Aptitude Tests: Mechanical and Clerical 


ates the correct switch, the white signal 
ae Ee lights on every trial) is extinguished 
immediately, signaling the candidate that he has 
made the correct response. The colored lights do not 
go out until they have been on for 3 seconds, regard- 
less of how quickly the correct switch has been 

hed. 

PuThe four spring-return toggle switches (S1, S2, S3, 
S4) are so set that the candidate must push each one 
in a different direction. The four directions of move- 
ment correspond to the four signal patterns formed 
by the lighting of the red and green lamps. Thus, if 
L1 and L4 are lighted, the red is “up” with respect 
to green, and the upper switch, S1, must be pushed 
up. If L3 and L4 are lighted, the red is to the right 
of the green, so the switch on the right, S2, must be 
pushed to the right. The time taken to operate the 
correct switch on each of a series of test trials is ac- 


cumulated on an electric stop-clock and constitutes 
the candidate’s score, 


(From Apparatus Tests, Report No. 4, Army Air 
Forces Aviation Psychology Program, edited by 
A. W. Melton. U. S. Government Printing Office, 
1947.) 


u] 


Test. 

ation Test was designed 
ability to coordinate the 
He is required to control] 


ric. 12.5. The Two-Hand Coordination 
The Two-Hand Coordin 
to measure a candidate’s 
movement of both hands, 


Tests of Mechanical Aptitude 317 


the movements of a target-follower in response to a 
visually perceived target moving at varying rates along 
an irregular pathway. 

A single test unit, as seen from the candidate’s 
position, is shown in the figure. Two handles which 
he manipulates are seen in the foreground and at the 
left. Rotation of the upper handle causes a contact 
point, which is mounted on the leaf of a micro- 
switch, to move toward the candidate with counter- 
clockwise rotation and away from the candidate with 
clockwise rotation. Rotation of the lower handle in 
a counterclockwise and clockwise direction causes the 
contact point to move to the left and right, respec- 
tively. Rotation of both handles simultaneously 
causes the contact point to move in any desired di- 
rection in the plane of movement of the target. A 
candidate’s task is to manipulate the controls in such 
a wav as to keep the targetfollower on top of a round 
brass button (the target) as it moves along an irregu- 
lar clockwise path. When the contact point is on the 
target button, the microswitch is closed and current 
flows to an electric clock located on a remote control 
desk. The time which is accumulated on the clock 
during a series of eight l-minute trials indicates the 
efficiency of the candidate’s performance. 

(From Apparatus Tests, Report No. 4, Army Air 
Forces Aviation Psychology Program, edited by 
A. W. Melton. U. $. Government Printing Office, 


1947.) 


TESTS OF MECHANICAL APTITUDE 
The capacity designated by the term mechanical aptitude is 
Not a single, unitary function. It is a combination of sensory and 
Motor capacities such as those already briefly described, plus per- 
ception of spatial relations, the capacity to acquire information about 
mechanical matters, and the capacity to comprehend mechanical re- 
lationships. Thus tests of mechanical aptitude are designed to meas- 
ure capacity and performance ona higher level of organization than 
are those of sensory-motor capacity and dexterity. 
The Assembly Test of General Mechanical Ability ® devised by 
J. L. Stenquist (1923); the first of its kind and now of little more 
than historical interest, Was intended to measure a person’s ability 


10C, H. Stoelting Co. Chicago. 


Aptitude Tests: Mechanical and Clerical 
318 


Test: 
ned Pursuit Test was designed 


ric. 12.6. Bi-Manual Planned Pursuit 
The Bi-Manual Plan 
and developed to measure 
activities of both hands by 
attention. The test consists of an irregular polished 
brass pathway which moves beneath two pointers, 
The pointers are separated by a distance of 8 inches. 
The pointers are adjustable by the candidate by 
means of two vertical handles, the candidate being 
required to keep the pointers (one with cach hand) 
in contact with the moving pathway. In view of the 
fact that a limited amount of the pathway is visible 
ptior to reaching the contact pointers, it was believed 
that a certain amount of planning could occur and 
would in part determine the score on the test. The 
test consists of six 15-minute trials. Rest periods of 
unspecified duration, Probably about 30 seconds, oc- 
curred between trials. The score is the length of time 
during which both pointers are on the pathways, 
(From Apparatus Tests, Report No, 4, Army Air 
Forces Aviation Psychology Program, edited by 
A. W. Melton, cnt Printing Office, 


ability to coordinate the 
a systematic shifting of 


U.S, Governm 
1947.) 


to put together the parts of among them a bicycle 
bell, a double-action hinge, Mousetrap, This test, 
consisting of three series, is c 


with individuals cover- 
ing the age range from children in the lower 
hood. 


mechanical devices, 
a door lock, and a 


Tests of Mechanical Aptitude 319 


> 009 
QO pon 


ric. 12.7. Triform Pegboard Test. 


Each test apparatus consists of two pegboards, 11.25 inches 
wide and 28 inches long, each containing +8 holes in 12 columns 
of + holes each, with a depth of ™« inch. Figure 12.7 is a photo- 
graph of one of these peg boards. There are 16 holes of each of 
three shapes: round, square, and triangular, Corresponding to these 
holes are 16 pegs of each shape. The 16 round pegs are painted 
red and have a diameter of %o inch; the 16 square pegs are painted 
yellow and have sides s inch in length; the 16 triangular pegs are 
painted blue and have ' inch sides. All pegs have the same length, 
which is 1 and e inches. In one of the peg boards the various- 
shaped holes are scattered randomly throughout the board, while 
on the other board they are grouped in three sections; the round 
holes on the left third of the board, which is painted red; the 
square holes in the center section, which is painted yellow; and the 
triangular holes in the right section, which is painted blue. 

The boards are placed in front of candidate. The board in which 
the holes are in irregular order contains the pegs and is placed 
above the board in which the holes are grouped by shape. ‘The task 
of the candidate is to transfer the pegs in a standard order from 
upper board to lower board. ne ae 

The test is designed for administration in a 15-minute test pe- 
riod, one minute of which is required for preliminary instructions. 
There are six 40-second trials, with 60-second intervals between 
trials for replacing the pess in their original positions and for rest. 
The first three trials are performed with the right hand, and the 
last three trials with the left hand. There is no practice trial. Scores 
are recorded for each trial. The score is the number of pegs placed 
in the bottom board during the 40-second period. No credit is 
given for a peg in the candidate's hand at the signal “Stop.” Credit 
is given for all pegs placed even though in improper order. 

Tests, Report No. +, Army Air Forces Avia- 


(From Apparatus R . 
tion a Program, edited by A. W. Melton. U. S. Govern- 


ment Printing Office, 1947.) 


The Stenquist tests have been revised and extended at the Uni- 
versity of Minnesota (1930) and are known as the Minnesota Me- 


Aptitude Tests: Mechanical and Clerical 
320 


4 inciple, these are essentially the same 
P bitte Te a mechanical devices having been 
. ike aie iew ones added. Performance on these tests—scored 
aga os and accuracy of work—has been found useful in 
a ms ie ee of junior high-school boys in shop courses. Also, 
facility a these assembly tests has been found by some investigators 
to be one significant indication of a person’s aptitude for a number 
of occupations such as machinist and auto mechanic, 


Fic. 12.8. Minnesota Spatial Relations Tests. The 
upper and low 


sent the formboards 
which are filled with the piec 


eces represented in the 
middle part, Educational Test Bureau. (By 


mission. ) 


The Minnesota Spatial Relations Test (1930) consists of a series 
of four boards, each of which has 58 cutouts 


of them unusual. The subject’s task is to re 
rect holes in the board. Evidence indicates 
mechanical occupations tend, as a group, 
do persons in nonmechanical Occupations. 
principal justification for use of the test as 
aptitude. Some critics of the test have con 


of various shapes, many 
Place these in their cor- 
that persons engaged in 
to earn higher scores than 
This fact, it appears, is a 
a measure of mechanical 

cluded that it is adequate 
u D. G. Paterson, et al., Minnesota Mech 


RET a n= 
anical Ability 

University of Minnesota Press, 1930. 
1? Marietta Apparatus C 


e o ee 
Tests, Minneapolis: 
0., Marietta, Ohio. 


Tests of Mechanical Aptitude 321 


as a measure of speed and accuracy in responding to details of 
spatial relations and that it yields a measure of an individual’s capacity 
to work with a variety of details in handling objects and concrete ma- 
terials. On the other hand, it is not adequate for measuring resource- 
fulness in solving problems of a mechanical nature, nor for measuring 
capacity to manipulate small objects with precision. 

The Revised Minnesota Paper Formboard (1948) is, as its name 
indicates, a test which reproduces in printed form the same type of 
problems as those presented by actual formboards."* In each problem, 


the subject is shown two or more parts of a geometric figure; when 
correctly assembled, the parts will make the complete figure. It is 
the subject’s task to identify the correctly assembled figure from 
among five choices. This test is designed, it appears, to measure one’s 
capacity to visualize and imaginally manipulate geometric forms. The 
data reported for it indicate that it has value for the prediction and 


Measurement of mechanical ability and for differentiating between 
y therein. Reported research has shown 


have a moderately good correlation with 
quality of mechanical performance and a moderate to low correla- 
tion with success in mechanical drawing and descriptive geometry. 
As a group, students in engineering and mechanical vocations obtain 
higher scores than do other groups of students. Available evidence, 
however, demonstrates that this test does not have high enough 
Predictive value to be used exclusive of other criteria and informa- 
tion, k 

A number of “pencil and paper” tests are designed to evaluate me- 
chanical aptitude by testing for specific mechanical information, se- 
lected vocabulary, and ability to perceive and deal with practical me- 
chanical problems. One of the earliest of these is the Stenquist Me- 
chanical Aptitude Test (1921 ) One part consists of problems pre- 
sented by means of pictures. In each instance, the subject is required 
to determine which of five pictures belongs with each of five others. 
This part is, essentially, a test of the subjects knowledge about me- 
chanical tools, objects, and devices, although there is some room for 
the perception of relations and for reasoning. The second part of the 
Stenquist test consists of some material similar to that in the first sec- 
tion (that is, matching missing pieces with the correct mechanical 


New York. 


various grades of proficienc 
this paper formboard test to 


1 Psychological Corporation, 
14 World Book Co., Yonkers: N. 


Aptitude Tests: Mechanical and Clerical 
22 


bjects) plus questions applied to cuts of machines and machine 
si à underlying assumption of a test like the Stenquist must be 
EN abit mechanical tools, objects, and devices reflects 
mechanical interests and that mechanical interests are indicative, in 
some degree, of mechanical aptitude. 


"First look at Problem 1. There are two 
Parts in the upper left-hand corner. Now 
look at the five figures labelled A, B, C, 
D, E. You are to decide which figure shows 
how these parts can fit together. Let us 
first look at Figure A. You will notice that 
Figure A does not look like the parts in 
the upper left-hand would look when 
fitted together. Neither do Figures B, C, 
or D. Figure E does look like the parts in 
the upper left-hand corner would look 
when fitted together, so E is PRINTED in 
the square above l at the top of the 
page.” 


IA 


FIG. 12.9, Specimen Item from Revise 


} d Minnesota Paper Form Board 
Test. Psychological Corpora 


tion. (By permission. ) 


difficulty, are designed anding of the opera- 
tions of physical and m latively simple situa- 
chool students, engi- 
h relatively untrained 


mewhat more difficult 


hool candidates, appli- 
15 By G. K. Bennett et al., New York: The Psychological Corporation, 
1951. 


and inexperienced persons. A s 


econd and so 
form is intended for use with 


engineering-sc 


1940- 


Tests of Mechanical Aptitude 323 


cants for technical courses or for employment in mechanical jobs. 
The third form was devised for use with high-school girls and women. 
Since the types of items included are intended to be appropriate to 
the level and experience of each group of examinees, many of the 
items used for the women’s test come within their range of household 
activities, involving objects and devices used in a home rather than 
in a shop. Other items, non-household in character, are also utilized. 


Which table is more likely to break? 


ric, 12.10. Specimen Item from Test of Mechanical 
Comprehension by G. K. Bennett and D. F. Fry. 
Psychological Corporation. (By permission.) 


Unlike some other tests of mechanical comprehension, this one 
does not require specific knowledge, such as matching the parts ofa 
tooi or some other mechanical object; nor does it require verbal 
knowledge regarding tools, processes, Or materials. The items in the 
Present test depict objects that are almost universal in American life, 
Such as airplanes, ladders, stairs, wheels, gears, pulleys, see-saws, and 
others. The hypothesis is that answers to the problems presented do 
Not depend upon specific information or training but can, rather, be 
arrived at by analysis of the nonverbal materials presented. The ex- 
tent to which this hypothesis is satisfied varies somewhat among the 
60 items of each test. For example, there is no doubt that familiarity 
with elementary physical principles or actual experience will be an 
aid in answering questions involving pulleys and leverage. Yet, indi- 
viduals without such advantages, whose analytical ability is adequate, 
are ive at correct answers. 

og Se age a Mechanical Abilities (1947) is intended 
for guidance purposes with pupils in grades seven through twelve and 
for screening purposes in industry. It is so devised as to provide a 
Profile of abilities that are common to many mechanical occupations. 


324 Aptitude Tests: Mechanical and Clerical 


It also provides a composite score. The content of the test was de- 
termined by analyses of courses of study and by job analyses in me- 
chanical and related occupations, T , ; 
The subtests are the following: arithmetic computation, from sim- 
ple addition through fairly simple fractions (no arithmetical-problem 
solving); reading simple drawings and “blueprints”; identification and 


use of common tools; spatial relationships (the Paper form board); 
checking measurements.** 


regard marks in high 
educational groups ( 


tions with tests of general intelligence as criteria, then we can say 
that some of the i i i 


ave a fair degree of ya- 


two ways: in part in terms of “f. 
correlations with instructors’ r 
mechanics. In the first instanc 
vice measures the skills presc: 
occupations. In the second i 
go from .60 to .78, with a me 


art in terms of 


se for aviation 
authors state that their de- 


essary in eleven mechanical 
coefficients of contingency 
These coefficients are among 


ribed as nec 
nstance, the 
dian of .67, 


By J. W. Wrightstone and C. E. O'Tool 
Bureau, Los Angeles. 


Tests of Mechanical Aptitude 325 


the highest in this area of investigation, and considerably higher than 
most others. 

Numerous studies have been published in which various tests of 
mechanical aptitude have been intercorrelated. The reported coeffi- 
cients are almost uniformly low (below .50) or very low. A few of 
the reported coefficients are moderate; that is, somewhat above .50. 
The reasons for these relatively low coefficients—unlike those found 
between the sounder tests of intelligence—are to be sought in the fol- 
lowing factors. (1) Some of the tests are much more comprehensive 
in scope than others that are relatively restricted and homogeneous in 
content; hence, the former measure a greater number of functions, 
some of which may have little communality with the latter. (2) Not 
all of these tests are calibrated for the same levels of difficulty; hence 
they do not have equal differentiating value at a given level. (3) 
Some of the tests are much more dependent upon experience and 
specialized information than are others. (4) Performance levels on 
several tests may reflect different degrees of interest and motivation 
in special areas. 

In connection with (1), above, study of the content of tests of me- 
chanical aptitude shows that they sample, more or less, the following 
functions: visual-motor integration, spatial visualization, perceptual 
al dexterity, and visual insights (analysis). In addition 
ests measure specialized information, knowledge 
tical problem-solving ability, and technical vo- 
functions are measured by means of apparatus 
tests (Figures 12.4, 12.5, 12.6), others by means of performance- 
type materials (formboards, etc.), and still others by means of pencil 
and paper tests. It is not surprising, therefore, that intercorrelations 
between these tests are low, even though they fall within the same 
category. 

On the whole, tests of mechanical aptitude show very moderate or 
only low correlations with actual job performance. This fact does not 
necessarily signify that the tests themselves are defective, for many 
non-mechanical factors enter into the job ratings received and into 
actual performance on the job. These factors include subjective judg- 
ments of the raters, the worker's health, motivation, and personality 
traits which may facilitate or impede performance. 

Although the foregoing factors might lower the correlation co- 
efficients, it is highly improbable that they are solely accountable for 


speed, manu 
to these, some of the t 
of techniques, arithme 
cabulary, Some of the 


326 Aptitude Tests: Mechanical and Clerical 


the relatively low relationships found. It is essential, therefore, that a 
given test of mechanical aptitude be studied for each type of occupa- 
tion where it is to be used, in order to establish its validity not only 
in terms of correlation coefficients but—and more significant—in 
terms of critical cut-off scores and expectancy tables for each of sev- 


eral levels of test performance. (See the discussion of these techniques 
in Chapter 1.) 


DRIVER 


ric. 12.11. “Which gear will make the most 
turns in a minute?” From the Bennett T, 
of Mechanical Comprehension, 

logical Corporation, ( 


est 
The Psycho- 
By permission, ) 

Some of the tests of mechanical a 


vocational or educational guidance is the Problem at hand; for they 
are valuable as supplements to other types of information, For exam- 
ple, the Bennett and Fry Test of Mechanical Comprehension appears 
to be quite useful for selection in situati 

machines is necessary, In 
sirable to administer more than 
intercorrelations of the sever 
enough to warrant their use inter articular combi- 
nation of tests used in epend upon the 
nature of the problem al concerned and 
upon the kinds of jobs under consideration, i 


Ptitude merit consideration when 


aptitude, since 
t nearly high 


the latter in that they cont aterials which are of 
significance for clerical Occupations, 


Tests of Clerical Aptitude 327 


The Psychological Corporation General Clerical Test (1950) is 
intended to examine routine clerical aptitude, proficiency in arith- 
metic, and verbal facility. The first is measured in terms of perception 
of similarities and differences between paired names and numbers, 
and in terms of skill in using a filing scheme. The second includes the 
usual arithmetic processes and problems. The third is tested through 
spelling, paragraph meaning, comprehending directions, word mean- 
ing, and language usage. 


1. 937-937 € ) 
2 obh-obh Gab) 
3. Curtis & Co. - Curtis&Co. ( ) 
4 Om-O E 


57. 6819002341-6839002341 (a) 
58. vrxoaediqf-vrxoaediqf ( ) 
59. H. W. Hicronymous - H. W. Hiernymous ( ) 
0 Jg ASA- yA7AaA ( ) 


VIG. 12.12, Rate and Accuracy in Perceiving Similarities and Differences. 
From Detroit Clerical Aptitudes Examination. Public School Publishing 
Company. (By permission. ) 


The Detroit Clerical Aptitudes Examination (1944) “ attempts to 
Measure more comprehensively than the foregoing, although the su- 
Periority of the Detroit test has not been demonstrated. The authors 
State that their test is intended to select pupils with capacities for 
Commercial courses in high school. The following parts are included: 
Tate and quality of handwriting, rate and accuracy in checking (per- 
ceiving similarities and differences between paired series of digits, let- 
ters, names, and small, simple geometric figures), simple arithmetic 
Processes, motor speed and accuracy (placing X’s in a page full of 
Small circles), knowledge of simple commercial terms, visual imagery 
(disarranged parts of a picture), rate and accuracy of classification 


Seen kor hae eae pon 
1 By H, J. Baker and P. H. Voelker, Bloomington, Ill.: Public School Pub- 


lishing Co. 


328 Aptitude Tests: Mechanical and Clerical 


(a letter-number substitution test, a variant of the digit-symbol test), 
and alphabetical filing. f , 

The Minnesota Clerical Test (1946) is one of the most widely 
used in this area of aptitude testing. It consists of two subtests: Num- 
bers and Names. The former consists of paired numbers, only some 
of which are identical and are to be m 


arked by the testee, for exam- 
ple: 9632-9632; 5179-5719. The Names test is designed to measure 


the same functions (perception of detail and perceptual speed), using 
identical and non-identical paired names. The differences within the 
latter pairs are minor: for example, Braun and Co.-Brown and Co. 
These two subtests have much in common regarding psychological 
functions (r = .65); but the Names test is More inclusive and com- 
plex, for it correlates considerably higher with intelligence scales than 
does the Numbers test. 


Evaluation of Clerical Tests, 

will suffice as examples of this 
On the whole, results obtained 
interpreted, may 


The three tests, briefly described above, 

type; for they are quite representative, 
with clerical aptitude tests, if critic 
contribute to a better underst 
pacities and to his guidance in the selection oj 
although the tests correlate only moder; 
mercial courses (from about .30 to about 
lations between test results and school m 
the exception, for the size of the coeffi 
factors besides inadequacies of the tests 
lack of interest or incentive in their c 
intellectual factors, 
tional forces, and 
teachers’ marks, 


While their reliability coefficients are generally within the satis- 
factory range, tests of clerical apti 


ally 
anding of a pupilľ’s ca- 
f a high-school course, 
ately with marks in com- 
:50). But moderate corre- 
arks are the rule rather than 
cients is affected by several 
themselves: namely, pupils? 


Ourses; interference of non- 
such as poor health, economic Pressure, emo- 


extracurricular activities; and the Variability of 


quality of performance on the 
ally fall between about .20 


than correlation coefficients. 
Authors of these tests Provide, 


among other d 
variety of groups—e.g 


ata, norms for a 
-, high-school seniors in commercial courses, 


Differential Aptitude Tests 329 


office workers, non-office workers, different classes of clerical work- 
ers, employed and unemployed office workers—which indicate the 
trend in scores that would be expected for each of several groups. 
But the range of scores within groups and the extent of overlapping of 
scores between groups are so considerable, and correlations are low 
enough, so that in any individual case detailed analysis of test results 
must be evaluated, not in isolation, but together with other informa- 
tion concerning the individual under study. 


DIFFERENTIAL APTITUDE TESTS * 

Rationale. This battery of tests is a most ambitious and com- 
prehensive instrument. Its statistical data on standardization and anal- 
ysis—reliability, validity, intercorrelations of parts, norms, population 
samples—are exceptionally thorough. 

Several guiding principles were applied in the development of the 
Parts of this battery: 

All eight parts of the battery were standardized on the same popula- 
tion, Thus the norms and percentile values for each test have the 
same relative significance as those for all the other tests in the battery; 
for the ranges ‘of age, aptitude, school grade, and non-intellective 
Personality factors were constant in the standardization process. Psy- 
chological profiles, therefore, are more meaningful for interpretation 
of differences within an individual. The published norms are based 
Upon a population sample of 47,000 boys and girls in grades 8 
through 12, from communities throughout the country. Separate bat- 
teries and norms are available for boys and girls. 

Each test in the battery should be relatively independent and meas- 
ure a relatively restricted range of aptitude. 

Each test in the battery should be intended for the same purpose 
as every other one: namely, educational and vocational guidance 
(rather than, for example, selection for a particular job). 

Each test in the battery should be useful for guidance in a number 
Of related areas, rather than in only one or two. 

Each test should measure level of aptitude—that is, “power” rather 
than speed of performance. ; 

The battery of tests should yield a profile in percentile ranks, All 
eight percentile ranks for an individual will be comparable since they 


have been derived from the same population sample. 


'S By G. K. Bennett, H. G. Seashore, A. G. Wesman. The Psychological Cor- 
Poration, 1947. Manual, Second Edition, 1952. 


330 Aptitude Tests: Mechanical and Clerical 


Contents. The battery includes the following eight tests. 

Verbal reasoning: verbal analogies of the familiar ty 
measure ability with more or less complex verbal co 
lationships. Since this test encompasses only 
verbal reasoning, “verbal analogies” 
name. 


Numerical ability: numerical relationships and facility with number 
concepts are tested—that is, essentially computation rather than arith- 
metical reasoning (solving problems). Some of the items test only skill 
in the four fundamental Processes; others require understanding of 
numerical concepts and relationships, 

Abstract reasoning: ability to reason wi 
Series presented in each problem requires insight into an Operating 
principle in the changing diagrams. (See Fig. 12.13.) 

Space relations: ability in Spatial visualization is tested by present- 
ing two-dimensional geometric figures (variously shaded) which are 
imaginally manipulated, each to form a three-dimensional figure, Us- 
ing two-dimension 


al drawings, the Purpose is to test aptitude to visu- 
alize constructed figures, variously rotated, 


Mechanical reasoning: mechani 
Same types of items as those alr 
chanical C. omprehension, 

Clerical Speed and Accuracy: speed and ac 
letter and number combinations 


pe are used to 
ncepts and re- 
a very restricted area of 
would be a more appropriate 


cal comprehension is tested with the 
eady described under Tests of Me- 


Curacy of responses to 
are measured. This test involves the 


Each row consists of four figures calle 
swer Figures. The four Problem F. 
which one of the Answer Figures 
the series. 


d Problem Figures 
igures make a series. You 
would be the next, or tl 


and five An- 
are to find out 
le fifth one in 


PROBLEM FIGURES 
© e 
e\ lle 


Study the position of the black dot, 
square clockwise: upper left corner, upper right corner, lower right corner 
lower left corner, In wh iti ext? It will come back to 
the upper left corner. Therefore, B is the answer, and you would mark your 
Answer Sheet like this A B Oso E 
(“Abstract Reasoning” ) 
hological Corporation. ( 


ANSWER FIGURES 


SARC) 


x g © 5 


€ 


around the 


FIC. 12.13. Space problem Si a k 
Aptitude Tests. The Psycl rom Differential 


By permission.) 


Differential Aptitude Tests 331 


matching of various combinations, emphasizing perception of detail 
and rate of response. 


TEST ITEMS SAMPLE OF ANSWER SHEET 


33 
= 


Fic, 12.14. Number and letter test items from differential 
aptitude tests. In cach row of the Sample Answer Shect, the 
examinee marks the combination that matches the underlined 
combination in the corresponding left-hand row. The Psy- 
chological Corporation. (By permission.) 
Language usage: Part I is simply a spelling test. Some words are 
Correctly spelled; others are incorrectly spelled. The subject indicates 
for each word whether it is right or wrong. To call this test “language 
Usage” is to stretch the meaning of that term to unreasonable lengths. 
Language usage: Part II is made up of sentences in which the ex- 
aminee js required to distinguish faulty from correct grammar, punc- 


tuation, and word usage. 
Evaluation, As already stated, these are among the few most thor- 
Cughly standardized and reported aptitude tests available. They meas- 
ure several psychological functions that have been shown to be signifi- 
Cant in educational and vocational guidance. In this connection, the 
'tercorrelations of the eight parts show that each has a separate and 
fferent—though not entirely unique—contribution to make; for the 
Coefficients range (for boys) from .06 (Mechanical Speed with Cleri- 
Cal Speed and Accuracy) to .62 (Sentences with Verbal Reason- 
Mg). The range of coefficients for girls is approximately the same 
(.12-.67), 

Extensive correlational studies with other standardized tests, of a 
Variety of types, show that the several parts of the DAT do not meas- 
ure entirely the same functions as do these others, although some of 
the Parts (Verbal Reasoning and Numerical Ability) are often highly 
Correlated with tests of general intelligence. The DAT, therefore, has 
à contribution to make in a comprehensive study of an individual’s 
aptitudes, 


33 26 


Raw Score 


85 
Abstract 


Percentile 


& 
a 


Percenti 


Standard 
Score 


70 — 


95 


cH 


60 — 


80 
75 


70 


s— 


50- 


4“ 


30 


45— 


25 
20 


4- 


3- 


30— 


Fic. 12.15. Differential aptitude tests. A profile of sco 


sion.) 


‘ie psychological rationale. presented in the Manual, and doth or 
these will make more than the usual demands upon the non-specialist’s 
technical sophistication. : 

As is the case with all tests of aptitudes, the ultimate value of the 
DAT will depend upon follow-up studies showing their predictive 
efficiency in learning and job performance. For this purpose, the 
authors and users of. this battery of tests will have to determine the 
differentia] and predictive significance of profile “patterns” and 
multiple correlations (between two or more parts of the tests, on the 
One hand, and the criterion, on the other), thus going beyond the 
Simple correlation coefficients and expectancy tables, important 


though these are. 


19 


APTITUDE CLASSIFICATION TESTS 


This battery of tests has been devised for purposes of voca- 
tional guidance. Its recommended occupational range (thirty voca- 
tions) covers a wide area at several levels: including such different 


Research Associates, 1953. This test is described 

at this poj ce for use in vocational guidance, covering a 
it is intended for us c r 

Tange RE AA oa elevisel and mechanical through some professions. 


i May, therefore, serve as a bridge to the following chapter. 


19 
By J.C, Flanagan, Science 


332 Aptitude Tests: Mechanical and Clerical Aptitude Classification Tests 333 


Extensive data are also reported showing validity coefficients be- 
tween DAT scores and grades in a variety of school subjects. These 
indexes are of approximately the same magnitude as those found with 
other sound tests. The correlations on the parts of the DAT vary with 
the several school subjects used as validity criteria; so that, as is to be 
expected, some parts have greater predictive value in a given area of 
Studies than do others. 

In addition to the foregoing, norms of performance are provided 
for a number of different educational and occupational groups. 

The tests are adequately reliable for the most part, the mean co- 
efficients for the several grade levels varying from .85 to .93 for boys, 
and .71 to .92 for girls. The test of Mechanical Reasoning (reli- 
ability = .71) is the only part that is seriously below accepted degrees 
Of reliability. 

The scores of this battery are reported in profile (graph) form, as 
Shown in Figure 12.15. The interpretation of such a profile and the 
Significance to be attached to score differences as shown on the graph 
depend upon adequate comprehension of the Statistical evidence and 


Percentile 
9 
5 
s0 
80 
75 
70 
60 
EJ 
40 
20 
25 
20 
10 
s 


permis- 


By 


45s 
80 
Sentences 


gical Corporation. ( 


— 


53 
60 
Spelling 


15 
Clerical 


43 


The Psycholo 


res, 


334 Aptitude Tests: Mechanical and Clerical 


occupations as office clerk, mechanic, humanities professor, nurse, 
physicist, and writer. eae 

The battery comprises fourteen tests, each of which is briefly de- 
scribed below: 


inspection: ability to spot flaws 

coding: speed and accuracy in coding typical office information 

memory: recall of codes learned in coding test 

precision: speed and accuracy in making small circular finger 
movements 

assembly: ability to visualize the appearance of an object from a 
number of separate parts 

scales: speed and accuracy in reading scales, graphs, and charts 

coordination: ability to coordinate hand and arm movements 


judgment and comprehension: ability to read, understand, reason, 
and use good judgment 


arithmetic: the four fundamental processes 
patterns: ability to reproduce simple pattern outlines 


components: ability to identify important component parts in line 
drawings and blueprint sketches 


tables: reading two types of tables—one using numbers, the other 
using words and letters 


mechanics: understanding mechanical principles 
expression: communicating ideas in writing and talking 


This device is of the differential aptitude type. Its 
various combinations of the fourteen tests as having Prognostic value 
in the selection of a vocation, For example, for the Prospective oc- 
cupation of nurse, the three recommended tests are (1) memory, (2) 
scales, (3) judgment and comprehension; for the Prospective humani- 
ties professor the recommended tests are (1) memory, (2) judgment 
and comprehension, (3) expression. So far as this battery is able to 
differentiate between aptitudes, then, the distinction between these 
two widely diverse professions appears to depend upon the tests of 
expression and of scales. It is hardly probable, 
differences in aptitudes, in this insta 
these sets of tests would indicate. N 


author suggests 


Aptitude Classification Tests 335 


between many pairs of the thirty occupations for which test combina- 
tions are recommended. 

So far as statistical analysis of standardization data is concerned, 
the results are fairly satisfactory. Intercorrelations between individual 
tests are generally low: seven of ninety-one coefficients are .50 or 
higher; the median is .29. Standard errors of measurement are moder- 
ate. Reliability coefficients for individual tests are fair (.26 to .86, 
Form A; .55 to .85, Form B). However, reliability coefficients for 
combined scores of all tests recommended for a particular occupation 
are much higher for nine occupations reported: namely, from .83 to 
-93. These reliability data make it clear that the “job element ap- 
Proach,” upon which this battery is based, yields much less stable 
results than does a more nearly “wholistic” or “global” approach. 

So far as predictive (on-the-job) validity is concerned, the value 
Of these tests remains to be demonstrated. The standardization popula- 
tion consisted of high-school seniors. Only a few of these have been 
followed-up and rated in their occupations. The correlations between 
Vocational test scores and occupational progress after four years 
(rate of salary increase), in seven occupations, ranged from .29 to 
‘64. Correlations of test scores (obtained in senior year of high 
School) with college grades were from .24 to .36. 

It is clear that in their present state, the Aptitude Classification 

ests must be regarded as tentative efforts in the direction of voca- 
tonal guidance by means of tests based on the “job element ap- 
Proach,” 


T3. 


NAA AAAA MAAA ANASANAT VAASAA NUANEAN TANANAN AAAA 


APTITUDE TESTS: 
FINE ARTS AND PROFESSIONS 


TESTS OF MUSICAL APTITUDE 


The Seashore Tests. The most widely known are the Sea 
shore Measures of Musical Talent (1919— 


corded on phonograph records, a 


ination of timbre, or tone quality 


ty); (5) rhythm discrimination 
(pairs of rhythmic patterns of increasing complexity, to be discerned 
as the same or different); and (6) tonal 


six parts do not represent an individu 
profiles of the six auditory capacities s 
should be based upon them. 

Seashore approached t 
of view different from that of 


dasa theoretical 


1 RCA Manufacturing Co., Camden, New Jersey. 


Tests of Musical Aptitude 337 


into its sensory components. Some of these components, he held, can 
be measured objectively, whereas others cannot. The six capacities 
mentioned above are among the measurable ones; but as Seashore 
himself insisted, they do not offer a complete index or profile of all 
components of musical aptitude. They are measures only of auditory 
Perception required in music. analyzed into six components. 

For the purpose of validation, results of the Seashore measures 
have been compared with teachers’ ratings of musical ability, with 
Musical achievement, and with quality of work in schools of music. 
The obtained correlations have not been such as to warrant the con- 
clusion that these tests of auditory perception are sufficiently valid in 
Predicting various levels of musical talent. Seashore himself, how- 
ever, has objected to attempts at an over-all validation of his meas- 
ures. While he has not done so himself, he maintains that each of the 
six tests should be separately validated against different kinds of 
Specialized musical activity. For example, the test of pitch discrimina- 
tion should be validated especially for players of string instruments. 

any rate, this much may be said: the Seashore measures reveal 
those persons whose auditory perceptions are so deficient that they 
could not successfully participate in the formal study or performance 
of music, 


The Wing Tests. The Wing Standardized Tests of Musical Intelli- 
gence (1948), also presented by means of a series of recordings, 
is intended to meet at least two objections that frequently have been 

rected against the Seashore tests: namely, that the latter are “at- 
OMistic” and that they are not based upon functions being trained 
and considered most important by teachers of music. The Wing tests 
Measure seven aspects of musical perceptiveness, yielding a score for 


? See J, a it, D. Lewis, and C. E. Seashore, Revision of the Seashore 
easures Ls mh an, Towa City: University of Iowa, 1940; H. M. 
Stanton, Measurement of Musical Talent, University of Towa, 1935; C. E. Sea- 
Shore, Psycholo y of Music, New York: McGraw-Hill Book Co., 1938; J. L. 
Ursel], The Psvchology of Music, New York: W. W. Norton, 1937; Paul R. 
w: i Experimental Study of the Seashore- 


arnsworth, « istorical, Critical, and 
wawana an eeu" Genetic Psychology Monographs, Vol. 9, 1931, 


Pp, 291.3 3 
By H. D. Wing and C. Win 


E g. Sheffield City Training College, Sheffield, 
peland, Cf. Fourth Mental Measurement Yearbook, pp. 344 ff. 
t C Use of nie Seat intelligence” is unfortunate, since the word “in- 
¢lligence” has been given another psychological connotation. 


338 Aptitude Tests: Fine Arts and Professions 


each; but test performance is represented by only a single total score, 
rather than by part-scores, since “musical intelligence” is held to be a 
unitary, though complex, aptitude. F i 

The seven parts are: (1) chord analysis, (2) pitch change, (3) 
memory, (4) rhythmic accent, (5) harmony, (6) intensity, (7) 
phrasing. These aspects of musical capacity are held to be more 
closely associated with actual teaching of music than are Seashore’s, 
since Wing’s standardized tests are based upon the individually con- 
structed and subjective types of tests employed by teachers and ex- 
aminers in music. The Wing tests, therefore, are considered by many 


teachers of music to be more relevant to the selection and training of 
persons in that art. 


The Drake Test. The Drake Musical Memor 
upon the premise that immediate memor 
basic factor in musical ability.’ 
items, the members of each 


ry Test (1934) rests 
y of melodic materials is a 
* It consists of a series of paired melodic 
pair differing in key, or time, or notes. 


The Madison Test. The Interval Discrimination Test (1942), by 
Madison, is, like the Seashore tests, an example of a measure based 
upon a theoretical assumption.’ The premise in this inst 
degree of ability to make interval discriminations is a 
musical aptitude. The intervals var 
those requiring very 


ance is that 


measure of 
y from those easily perceived to 
fine auditory discrimination. 


Evaluation of Music Tests. The foregoing tests of musical aptitude 
are reasonably reliable measures of the functions they measure, when 
the total scores are considered, On the Seashore, for example, relia- 
bilities of the part-scores 


range from only .62 to .88 (Series A). This 
means that the part-scores often are not reliable enough to discrimi- 
nate, except grossly, between the several sensory functions within 


each person, Undoubtedly, the total Seashore score is more reliable; 


+R. M. Drake, “Four New Tests of Musical Talent,” Journal of Applied 
Psychology, Vol. 17, 1933, pp. 136-147; “The Validity and Reliability of Tests 
of Musical Talent,” ibid., pp. 447-458. Test materials are obtained from Public 
School Publishing Co., Bloomington, III, 

5T. Madison, Interval Discrimination as a Measure of Musical Talent, 
Archives of Psychology, No. 268, 1942, The test is obtainable from its author, 
University of Indiana, Bloomington, 


Indiana. 


Tests of Musical Aptitude 339 


but since the Seashore theory posited separate factors rather than 
a unitary function, total scores are not reported and total score relia- 
bilities are not given. 

The Wing tests, however, are based upon the theory of a unitary 
function; their total score reliabilities, therefore, are reported. The 
coefficient of reliability, using test-retest total scores, average .95 for 
an interval of one year, and 86 for intervals varying from two to four 
years, These coefficients fall within the range of satisfactory reliabili- 
ties. 

It is admittedly very difficult to validate a test of musical aptitude 
against a criterion of achievement, since grades in music are frag- 
ading is exceedingly subjective, training op- 
Portunities are markedly unequal, and interest and motivation are 
highly variable from person to person. Thus it is that even after the 
Years the Seashore tests have been in use, many psychologists believe 
their predictive value has not been sufficiently demonstrated. Available 
Seashore validity coefficients vary from low to moderate. 

The Wing tests present weightier evidence of validity, including a 
Study of above-average, average, and below-average students of 
Musical instruments (ages 14-16; N = 333 boys). Forty percent of 
the lowest group had discontinued their musical studies, as compared 
With 27 percent of the average group and 2 percent of the highest 
group. A finer classification would have defined the failures even 
More precisely. : a 

On the whole, it appears that in spite of uncertain over-all validity, 
these tests are ‘helpful in identifying persons whose prospects are 
Poor and those whose are high for profiting from instruction in music, 
SO far as these essential psychological functions are involved. 

In spite of the differences in hypotheses and scoring between the 

Cashore and the Wing tests, it appears from factorial analyses that 
the two instruments are measuring the same aptitudes to a marked 
degree, when the parts of each are taken in combination as Wing has 
done. It appears that there is a “general factor” operating in the sev- 
eral tests, accounting for 30 to 40 percent of the variance found 
among individuals. This general factor is identified as essentially a 
complex perception of the sensory aspects of music. Since this appears 
to be the case. it would be desirable to measure these basic aspects 
Of musical aptitude by means of the more complex processes such 


mentary or nonexistent, Sr 


Aptitude Tests: Fine Arts and Professions 
340 


ampled by the Wing tests.° This method of testing is indicated 

os ventas in view of the fact that teachers of music and others in that 

PE maintain that the separate sensory functions (as measured in the 
e 


Seashore) do not operate in music as they do in the controlled and 
isolated test situation. 


TESTS OF APTITUDE IN THE GRAPHIC ARTS 


The Knauber Test. The Knauber Art Ability Test (1932— 
35), for use in grades 7 to 16, requires the following: drawing a 
design from memory; drawing from memory figures within space limi- 
tations; drawing a stereotyped character such as S 
ing within a given space a specified composition; 
pleting designs from supplied elements 
compositions, such as incorrect 
correctly proportioned details, 
production of compositions inte 
genuity, ability to represent 
execute a universal idea. 


anta Claus; arrang- 
creating and com- 
; Spotting errors in drawn 
perspective, misplaced details, in- 
incongruous or inconsistent elements; 
nded to show creative imagination, 


in- 
a concept symbolically, or to pl 


an and 


on some parts, it ap- 
aditional prob- 


in art, quality of observati » Creative imagina- 
tion. That is, the test would be in large part a measure of learning 
rather than i 


ry usefulness is with individuals 
who have not had the benefits of some formal instruction, 


The Meier Test. The 


7 to 12, consists of 100 pairs of uncolored pictures, one member of 
each pair being a reproduction of a masterpiece, The pic 
a wide variety of subjects: 


: landscape, still life, woodcu 
drawing, murals, and others. The Second member of the pa 


Meier Art Judgment Test (1940),§ for grades 


tures cover 
ts, oriental 
ir is altered 


ê See J. McLeish, “Ti f Seashore’s Measures 
by Factorial Methods,” British Journal of Psychology, 
3, 1950, pp. 129-140, Also, H. D 


he Validation oi 


of Musical Talent 
Statistical Section, Vol. 


- Wing, “A Factorial Study of Musical Tests,” 
British Journal of Psychology, Vol. 31, 1941, pp. 341-355, 
* Psychological Corporation, New York, 


“Bureau of Educational Research and Service, State University of Iowa, 
Towa City. 


Tests of Aptitude in the Graphic Arts 341 
PROBLEM 3 


This is a test of accuracy and observation. 


ACA Cae 


Score 10 
These drawings are good in proportion and quality of line. Details are 
accurately observed. 


Ze 


Score 6 
s are not as good, and the lines are weaker, 


The Proportions in these drawing! 


eee 


Score 3 


These drawings are poor in proportion and 
of careful observation. 


in execution. They indicate a lack 


shown a design in the examination book- 
let which he is asked to copy. This item is a test of accuracy and 
quality of reproduction. From the Knauber Art Ability Test. (By 
í permission.) 


FIG, 13.1. The subject is 


from the original in some respect and in such a manner as to make 
1t inferior to the original. The nature of the alteration in each pair is 
dicated (perspective, use of curves, arrangement, ete. ); the subject 
'S, Of course. not told which one is the original and which is the altered 
Copy. It is his task to indicate his preference as between the two copies 
Of each pair, The author of this test has selected not artistic execu- 


Aptitude Tests: Fine Arts and Professions 
342 


i ther, aesthetic judgment as the principal factor in art ap- 
i ann the only factor to be tested, the contention being that 
titude and a “key-capacity,” the most trustworthy and significant 
eect ne in art and to success in a career in art. The soundness 
es eee however, remains to be validated, If it were correct 
Pe artistic perception and judgment constitute the “key-capacity,” 


FIG. 13.2. One 
work of established merit. The other is an 
of that work, and aesthetic. 


the subject is required to select the original and 
aesthetically superior work on the basis of the shapes 
of the bowls. From the Meier Art Judgment Test. 
Bureau of Educational Research and Service, State 
University of Iowa. (By permission. ) 


of these pictures represents an artistic 


adaptation 
ally inferior. In this pair, 


rather than actual execution 
critics and writers on art to 
but such is not the case. 


and creativity, we s 


hould then expect art 
be the most talented 


in actual Production; 


The Graves Test. 
devised to measure 
enumerates and describes ej 


The Graves Design Judg 


proportion, and rhythm. Th 
extent to which an individu 
these principles. 


s Psychological Corporation, New York. Š 


Tests of Aptitude in the Graphic Arts 343 


Each of the 90 items consists of either two or three plates from 
which the testee selects the one he regards as superior to the other 
One or two. The pictures in each item are two-dimensional or three- 
dimensional designs (“non-representational”), only one of which in- 
Corporates the principles of design upon which the test was devised. 
The designs are such as to eliminate or minimize subject matter and 
Personal connotations and feelings, while emphasizing relatively pure 
aesthetic choice. 

All eight principles are incorporated within one design in each 
set (item), whereas one or more principles are violated in the other 
designs of each set. Thus this test attempts to evaluate a unitary 
Complex process (aesthetic judgment) rather than making separate 
analyses of elements that enter into aesthetic judgment. In this respect 
the Graves test is consistent with the Wing test in music and with the 
Principle of a general factor in intelligence testing. 

It is important to note the criteria upon the basis of which the items 
Were selected in this complex and highly subjective fine art. Retained 
items in the final test satisfied all three of these criteria: (1) agree- 
Ment among teachers of art as to the superior design in each set of 
Plates; (2) greater preference for a design by students of art than 
by non-art students: and (3) greater preference for a design by high- 
Scoring subjects than by low-scoring subjects. These are the criteria 
8enerally employed in this field of test construction. 


Evaluation of Art Tests. The authors of tests of aptitudes in the 
Visual arts are not in complete agreement regarding the capacities to 
© measured. Nor does it appear that any one of the sounder tests, 
Such as those described above, is markedly superior to the others in 
Predictive value, The tests are of two general types: production of 
SPecified art forms (performance), as in the Knauber, and aesthetic 
Judgment, as in the others. Since aptitude for aesthetic judgment is not 
“quivalent to aptitude for art production, a combination of both types 
Should have greater selective and predictive value than either one 
alone for educational and vocational purposes.  — — 
It is not surprising that tests in this category differ in content, for 
experts in this highly subjective field of art are not in close agreement 


"garding the basic criteria, or principles, to be employed in the selec- 
i that is, regarding aesthetic judg- 


ily difficult, therefore, to establish uni- 


Aptitude Tests: Fine Arts and Professions 
344 


l criteria for the selection and scoring of test items and for the 
validator f test scores, once the items have been selected, 
ba ined to the foregoing problem is the inevitable result 
ae dierent teachers and critics of art, 
their evaluations, will rate art produ 
that there are relatively few useful 


cational activities. 


only fair to low, when correlated with 


dividuals of unusual capaci „in evaluating apprecia- 
tion of some of the graphic arts. The: i 
differentiate between art non-art students, the differences 
between their mean SCOT i i 


ive or even only 
reproductive performance, 

Finally, the available tests of 
parently, that this capacity is 
ations of pictures and of relati 


Several tests of 
first one has appe 


1 See I. L. Kandel, Professional Aptitud. 


e Tests in Medicine, Law, and Engi- 
neering, New York: Teachers College, Columbia University, 1940, 


Tests of Aptitude in Medicine 345 


Recent editions have included all or most of the following subtests: 
visual memory, memory for content, scientific vocabulary, understand- 
ing of printed material, scientific definitions, and logical reasoning. 

These tests have been based upon an analysis of the qualifications 
necessary for the successful study of medicine, which are given as 
the following. First, it is essential to have sufficient mental alertness 
to learn quickly and to organize the material learned so that it can be 
Tetained and utilized in later work. A sampling of medical materials 
Was used in the test to examine this capacity. Second, past scholastic 
Performance may be expected to indicate future learning. Inasmuch 
as all premedical students have had elementary courses in chemistry, 
Physics, biology, and English, sections of the test are devoted to ques- 
tions in these subjects to determine the extent of the candidates’ learn- 
mg in them. Third, the capacity to make correct interpretations and 
deductions from given data was considered essential. Hence, the 
test included a passage of difficult reading, dealing with materials 
found in medical studies. The testee is required to make certain in- 
terpretations and deductions based upon the passage, to which he 
May refer at any time. Fourth, since medical students and practicing 
Physicians are expected to draw conclusions and make diagnoses 
from given facts, a subtest was devised to evaluate “logical reasoning.” 

his consists of a set of premises and conclusions drawn from them, 
the Student’s task being to determine whether or not the conclusions 
are Warranted. 

The reader has surely noted that the mental capacities to be meas- 
ured by means of the foregoing types of items are by no means pecul- 
lar to the study of medicine. They are, indeed, capacities which are 
required in all fields of study and in all professions. The tests of this 
Medica] committee, therefore, though they are called aptitude tests, are 
Actually tests of general ability, utilizing in part a special content 
Which jg included Fn or closely associated with the materials of study 
™ medical schools. In other words, the form of mental activity being 
tested jg the same as in any other professional field, but the content 
'S in part specialized. It is in this sense that this test and others like it 
are aptitude tests, as that term has been defined. It is important to 
Note this fact; for otherwise one might get the impression that a spe- 


Clalizeg aptitude, independent of general ability, is required for the 


S 
tudy of medicine. 


6 Aptitude Tests: Fine Arts and Professions 
34 


the Medical College Admissions Test (1946— 
pices an era developed and administered by the Educa- 
tional Testing Service “ for the Association of American Medical 
Colleges. The purpose of this aptitude test was stated to be: to pro- 
vide highly dependable measures of the advanced student's general 
ability and of his achievement in a special field of study, The tests 
are predicated upon the principle that a significant aspect of poten- 
tiality for a specialized field of study at the graduate and pre-profes- 
sional level may be measured by testing the candidate’s general 
scholastic ability and his achievement in a special field which is pre- 
requisite to advanced study in the same ora 
In addition, a test of “ 
included in the battery. 
political science, and soci 
to social issues rather than materials retained from college courses 
in these fields of study, 


To measure general ability, the battery has two major divisions: 
verbal and quantitative. Verbal ability is measured by tests of vocabu- 


lary (word Opposites), sentence completion, and word analogies. 
Specimens of these follow. 
Opposites: 
Harmony: (a) accord (b) 


(e) desecration 
Sentence completion: 


The manufacturing of sm 

lacking mineral resources 
ploited in place of = ai 

(a) quantity-quality; (b) power-effi 

(d) alloys-ores; (e) skills-materi 

In addition to the foregoing, the test on current society and the 

test on materials from college co 
be regarded as contributing to an 
verbal ability, although these tw. 


affinity (c) oppression (d) conflict 


all machinery is profitable for the Nation 
, inasmuch as 


—— can be ex- 


ciency; (c) skills-quality; 
als 


11 Princeton, New Jersey. This test was formerly called the P 
tude Test. 


Tofessional Apti- 


Tests of Aptitude in Medicine 347 


Ability in quantitative thinking is measured with tests of arithmeti- 
cal reasoning, applied and abstract. 

The test of achievement in a specialized field of study is, naturally, 
a science test covering a wide sampling of concepts and problems from 
basic college courses in biology, chemistry, and physics. These items 
are intended to evaluate the candidate’s “grasp of fundamental prin- 
ciples of science.” 

The science test includes completions, classifications, analogies, 
quantitative comparisons, and paragraph comprehension. Taken as a 
whole, the science test is concerned with basic scientific concepts, and 
with applications and problems in the several fields. The problems are 
of increasing complexity requiring comprehension, interpretation, 
inference, and analysis of data—in general, the use of knowledge in 
dealing with multi-phase problems. Sample items and descriptions 
follow. 


COMPLETIONS 

Sample Directions: Each of the following incomplete statements is 
followed by five suggested completions. Select the one completion 
which is best in each case and indicate your selection in the appro- 


Priate space on the answer sheet. 


16. A sodium atom and a sodium ion 
(A) contain the same number of electrons 
(B) contain the same number of protons 
(C) have the same chemical properties 
(D) have the same physical properties 
(E) have different atomic numbers 


CLASSIFICATIONS 
Sample Directions: Each of the numbered words or phrases below is 


associated with one, both, or neither of the headings listed as (A) 
and (B) above it. On the appropriate line of the answer sheet blacken 
the space under 


A if the numbered word or phrase is associated with (A) only, 
B if the numbered word or phrase is associated with (B) only, 
C if the numbered word or phrase is associated with both (A) 
D 


and (B), y f ; 
if the numbered word or phrase is associated with neither 


(A) nor (B). 
Sample Questions: 


(A) Thyroid gland 
(B) Pituitary gland 


48 Aptitude Tests: Fine Arts and Professions 


(C) Both 
(D) Neither 
17. Giantism 
18. Low blood calcium 
19. Cretinism 
20. Short stature 


ANALOGIES 


Sample Directions: Each of the following questions consists of an in- 
complete analogy with five suggested completions. Select the one 
word or phrase which best completes the analogy and indicate your 
selection in the appropriate space on the answer sheet. 


Sample Questions: 

21. ohm : resistance :: watt i —__ 
(A) electricity (B) work (C) power 
(E) potential 

22. atom : molecule :: element : 
(A) electron (B) mixture 
(E) compound 

23. yolk > égg :: —— : bean seed 
(A) hypocotyl (B) epicotyl (C) cotyledon 
(E) endosperm 


(D) current 


(C) isomer (D) isotope 


(D) testa 


QUANTITATIVE COMPARISONS 


Sample Directions: The following paired statements describe two 
entities which are to be compared in a quantitative sense. On the 
appropriate line of the answer sheet blacken the space under 


A if A is greater than B, 

B if B is greater than A, 

C if the two are equal or very nearly equal. 
Sample Questions: 
24. (A) 

(B) 
25. (A) 


The total resistance of two given resistances in series 
The total resistance of the same two resistances in p: 
The volume occupied by one gra 

helium at standard conditions 
The volume occupied by one gram-molecular weight of 
oxygen at standard conditions 

26. (A) The concentration of oxygen in the right auricle of a m 

malian heart 


(B) The concentration of oxygen in the left auricle of a mam- 
malian heart 
PARAGRAPH COMPREHENSION 


Sample Directions: In this part of the test th 
each followed by a series of statements. Re 


arallel 
m-molecular weight of 


(B) 


am- 


ere are several passages, 
ad the passage and then 


Tests of Aptitude in Medicine 349 


classify each of the statements under one of the following cate- 
gories: 


(A) The statement is warranted by information given in the 
passage. 

(B) The statement is true but not warranted by the passage. 

(C) The statement is contradicted by the passage. 

(D) The statement is contradicted by established evidence but 


not by the passage- A 
[Scientific passages then are given, each followed by several 


statements to be marked as directed above.] 
In a second part of paragraph comprehension, the passages are 
followed by five questions, from which the best answer is to be se- 


lected, based upon material in the paragraph. 


Sample Passage: 

The only carbohydrate which the human body can absorb and 
oxidize is the simple sugar glucose. Therefore, all carbohydrates 
which are consumed must be changed to glucose by the body before 
they can be used. There are specific enzymes in the mouth, the stom- 
ach, and the small intestine which break down complex carbohy- 
drates. All the monosaccharides are changed to glucose by enzymes 


secreted by the intestinal glands, and the glucose is absorbed by the 
capillaries of the villi. i 

The following simple test is used to determine the presence of the 
monosaccharides. 1 Benedict's solution is added to a solution con- 
taining glucose or one of the other monosaccharides and the resulting 
mixture is heated, a prick-red precipitate will be formed. This test 
Was carried out on several substances and the information in the 
following table was obtained. “P” indicates that the precipitate was 
formed and “N” indicates that no reaction was observed. 


Material Tested Observation 
Crushed grapes in water 
Cane sugar in water 


Corn syrup 
Molasses 


Zig 2 u 


Sample Questions 0” Passage: 
31. From the results of the test made upon crushed grapes in water, 


one may say that grapes contain 


(A) glucose 

(B) sucrose 

(C) a monosaccharide 
(D) no sucrose 

(E) no glucose 


350 Aptitude Tests: Fine Arts and Professions 


32. The carbohydrate content of which one of the following foods 
probably undergoes the LEAST change during the digestive 
process in the human body? 

(A) Cane sugar 
(B) Corn syrup 
(C) Molasses 
(D) Bread 

(E) Potato 


Evaluation of Medical Aptitude Tests. The purpose of these tests is 
to provide an improved basis for predicting quality of perform 
medical studies, not in medical practice. In some instances, t 
have shown better results than have premedical course m 
other instances the reverse has been true. But in 
found that the best criterion is a combination 

premedical course marks, For example, he repor 
following correlations: premedical course marks 
averages, 0.67; test scores with medical school averages, 0.64; medical 
school averages with a combination of test scores and premedical 
marks, 0.81 (multiple correlation). These are very satisfactory co- 
efficients. Other correlational studies yielded Coefficients that were 
higher in some instances but much lower in others. 

The variation in correlation coefficients between test scores and 
medical school grades, found in different studies, is not attributable 
solely, or perhaps even principally, to defects in the medical aptitude 
tests. The differences among coefficients must also reflect serious 
differences in medical school grading standards, inequalities of under- 
graduate preparation (which to some extent can be compensated for 
in medical-school courses), and personality traits which tend to pro- 
duce inconsistencies between Promise and performance, 

Expectancy statistics have also 
test scores. For example, one stud 
the highest-decile students failed i 
percent of the lowest-decile stude: 


ance in 
he tests 
arks; in 
all instances, Moss 
of test results and 
ts in one study the 
with medical school 


that the lowest decile group contri ailures— 
that is, two and a half times its quota, i i 
of students. 


Tests of Aptitude in Medicine 351 


within the acceptable range, namely, from .89 to .94 for the several 
parts. These reliability data again show that it is not as difficult to 
establish reliability as it is to demonstrate validity. 

While these aptitude tests are not intended to predict effectiveness 
in medical practice which, like other professions, is dependent upon 
a complex of factors, the tests appear also to have some value in fore- 
casting medical students’ levels of success in internships. When a 
group of interns were rated on a five point scale by their hospital staffs, 
the results showed that the tests have some selective value, especially 
in identifying students who prove to be the most satisfactory interns.’* 

With study in professional schools, including medicine, now within 
the resources of very large numbers of students, continued research 
leading to the development of increasingly effective testing instruments 
is essential, It would be desirable to have thorough studies made of 
such factors as: effects of coaching and cramming upon the medical 
aptitude test scores; relationships between test scores and ratings on 
interest inventories; relationship of test scores to drop-outs in the 
form of expectancy tables (that is, value of the tests in predicting 
survival in the professional school; not in predicting grades alone); 
Correlations between test scores and interview ratings; role of person- 
ality traits in medical school performance and survival (that is, degree 
of emotional stability, degree and type of motivation, introversion- 
€xtroversion, dominance-submission, kinds and strengths of values, 
etc.) Admittedly, research on these and other personality traits would 
be difficult and ‘long; but they might, nonetheless, prove significant. 


TEST OF APTITUDE IN LAW 


General Characteristics. Tests in this field are aptitude tests 
in the same sense as are those in the field of medicine. Psychologists 
concerned with the problem of testing aptitude for legal study agree 
that the following abilities are most important: reading rapidly and 
comprehending relatively difficult material, rapid memorizing and ac- 
curate recall, reasoning by analogy, discriminating between the rele- 
vant and the irrelevant in a mass of facts, reasoning inductively and 
deductivety, facility in using and acquiring a vocabulary. Legal apti- 
tude tests thus far constructed attempt to measure most or all of these 
abilities in some degree. 


™ Kandel, op. cit., p. 20. 


352 Aptitude Tests: Fine Arts and Professions 
52 


The Iowa Legal Aptitude Test? (1948) includes the following 
seven parts: analogies, reasoning, opposites, relevancy, mixed rela- 
tions, memory, and information. (1) The first of these requires the 
examinee to select from a group of six words the two which have to 
each other the same relationship as two key words. (2) The second 
part presents a series of stated premises as true; 
decide whether a given conclusion is necessarily true, probably true, 
necessarily false, probably false, or indeterminate, (3) The third is 
the usual test of word Opposites, with the addition that the subject 
must recognize specified parts of speech. (4) The fourth consists of 
a legal case which is presented for study; the student must determine 


whether each of the principles presented would be relevant to the 
argument of the plaintiff, the defend. 


the examinee must 


on part one; approximately two 
five, he is required to answer mult 
facts, without referring back to th 
of true-false statements concernin 


(2) Data Interpretation; (3) Rea 
(4) Debates; (5) Best Arguments; (6) P. 
reading, to detect the word in each para 
graph’s meaning). 

Although the ent are named somewhat dif- 
ferently, they are igi 


Iowa and others: und 


Figure 13.3. 


™ By M. Adams, L. K. Tunks, and D. B, Stuit, Bureau of Educational Re- 
search and Service, State University of Iowa, 


$ ti V Iowa City. 
14 Educational Testing Service, Princeton, N. J, 


Test of Aptitude in Law 353 


Directions: In each of the following problems you will be given cwo groups of figures, 
A and B, followed by two numbered figures. You must decide what characteristic it is 
that distinguishes the A figures from the B figures. 


JAroolkneel U A 


Pa | | 


In the i! bove, you will note that all the figures under A are white; those 
under EERST The problem is to classify the figures under 1 and 2 as belonging 
to group A or to group B. The figure numbered 1 is white and therefore is an A; the 
figure numbered 2 is striped and therefore is a B. On the answer blank a cross has 
been made in the box beneath A-B, showing that in the last column there is an A 
figure followed by a B figure. 
For each question determine in which group (A or B) each of the numbered figures 
‘our answer sheet, record your answer under the pair of letters 


belongs. 
Whicle shows the oner in which the numbered figures appear— (A-A if both figures 


are A; A-B if the first figure is A and the second B; B-A if the first is B and the 
second A; B-B if both are B). 


Nleagaokdgd| & P 
gO ROLI Bo 


@REO|VOUA!® Y 


FIG. 13,3, Sample Items from the Law School Admissions Test. Educa- 
tional Testing Service. (By permission.) 


Answer blank: 


& 
oe) 


16 


Evaluation of Legal Aptitude Tests. Prediction of scholastic achieve- 
Ment in law schools has been fairly widely studied.’ The two most 
valuable bases for prediction are marks in prelaw courses and per- 


™ See H. R. Douglass, University of Minnesota Studies in Predicting Scho- 
lastic Achievement, Part Il, Minneapolis: University of Minnesota Press, 1942; 

- M. Adams, “Prediction of Scholastic Success in Colleges of Law,” Edu- 
cational and Psychological Measurement, Vol. 3, 1943, pp. 291-305, and 
Vol. 4, 1944, pp. 13-19. Also: A. B. Crawford and T. J. Gorham, “The Yale 
Legal Aptitude Test,” Yale Law Journal, Vol. 49, 1940, pp. 1237-1249; W. B. 
Schrader and M. Olsen, “The Law School Admissions Test as a Predictor of 
Law School Grades,” Princeton, N. J.: Educational Testing Service, 1950, p. 10. 


354 Aptitude Tests: Fine Arts and Professions 


formance on legal aptitude tests. Investigators are not in agreement 
as to which of these is the more significant, some having found that the 
former yields higher coefficients, while others have found the latter 
does so. In regard to either one of these predictive factors, the coeffi- 
cients of correlation with first-year marks in law school were largely 
between .40 and .60. However, when prelaw college marks were com- 
bined with results of law aptitude tests, the multiple correlations with 


law-school marks were raised to the high .60’s, and in some instances 
to the high .70’s. 


sumably, having been denied admission 
range of performance both on the 


criterion (law school grades). 


A second factor is the disparities between ac 


is, validity should be establishe 
regard to both prelaw grades and aptitude test scores, 


Another important aspect of this problem is the fact t 
selection of test content was a matter of face. 


for the tests were included because they seem 


hat original 
-validity. The materials 
appropriate to teachers 


Tests of Aptitude for Teaching 355 


and practitioners of law, since the items are believed to require mental 
activities analogous to those used in the study and practice of the 
profession, This hypothesis, seemingly reasonably (and not unlike that 
adopted for other types of tests), should be subjected to investigation. 
Some of the questions to be answered are these: What is the predictive 
value (validity) of each of the several parts? What is the reliability 
of each part? How significant is the speed factor? What is the signifi- 
cance, in the tests, of technical vocabulary and legal information? Are 
the test batteries (requiring 3 to 4 hours) too long, so that fatigue and 
satiation influence the scores? To what extent do coaching and cram- 
ming affect test performance? 

In spite of existing inadequacies, the available tests for selection of 
students of law are sufficiently far along in development and are help- 
ful enough to warrant their use in conjunction with prelaw college 
grades. This conclusion is warranted by the significant validity coef- 
ficients found in some universities, by the high coefficients of multiple 
Correlation, and by the ability of the tests to identify a very large por- 
tion of the candidates who show high promise and of those who show 


low promise. 


TESTS OF APTITUDE FOR TEACHING 

General Characteristics. It is extremely difficult to devise tests 
in this field because professional preparation of teachers is not so nar- 
rowly defined as is the case, for example, in law or medicine; and 
because teaching encompasses such a wide range of subject matter 
and educational levels. Probably the most useful single psychological 
test to predict quality and level of performance in the preparatory 
course for teaching is a test of general intelligence (or scholastic apti- 
tude), since preparation for teaching is essentially academic, Equally 
important are the personality traits, essential in successful teaching, 
that may be discerned by means of interest inventories and person- 
ality tests, including projective techniques. As yet, however, no defini- 
tive psychological studies have been made of those traits. 

The Coxe-Orleans Prognosis Test of Teaching Ability (1930) is 
intended to measure capacity to learn the subject matter taught in nor- 
mal schools and in other institutions that educate teachers.’® The test 
Consists of five parts: general information; student’s observation of 


16 World Book Co., Yonkers, N. Y. 


356 Aptitude Tests: Fine Arts and Professions 


classroom practices; brief lessons followed by questions in educational 
psychology, educational measurement, principles of teaching; compre- 
hension of professional reading materials; problem situations, followed 
by questions to test the student’s evaluation and response to each one. 

The National Teachers Examinations" (annual editions since 
1940) are designed for use in the selection of teachers and for the 
appraisal of students in training. They are essentially subject-matter 
tests (including problems and applications) for the evaluation of rela- 
tive adequacy of preparation of candidates for teac 
elementary and secondary schools. The battery of tests has two major 
divisions: the Common Examinations and the 


Optional Examinations. 
Common examinations include the following tests, intended to 
measure professional knowledge, breadth of perspective on contempo- 


rary life, and reasoning ability. 


hing positions in 


Professional information: general 


Principles and methods of teach- 
ing; educational psychology; 


child development; guidance and person- 
nel services; evaluation of materials; philosophical and historical de- 
velopments relating to contemporary education. 
General culture: history, liter 
mathematics; English expression, 
Nonverbal reasoning: “abstract figures” 
found in the matrices tests (discussed in an 


ures are incomplete; the examinee is require 
of arrangement. 


ature, and fine arts; science and 
are used, similar to those 
earlier chapter). The fig- 
d to discover the principle 


Optional examinations, in addition to the foregoing, 
specialized areas, as follows: education in the elementary school 
(grades 1 through 8), early childhood education (through grade 3), 
biological sciences, English language and literature, industrial arts, 
mathematics, physical sciences, social studies, and physical education. 


are offered in 


: ri; fter pro- 
fessional education to measure the professi 
standing the candidate has acquired, 


1 Educational Testing Service, Princeton, N. J. 


Tests of Science and Engineering Aptitudes 357 


The Coxe-Orleans test, for which data are available, does what it 
sets out to do about as well as aptitude tests in other fields; for its 
authors have found correlation coefficients with scholastic grades in 
normal schools to vary from about .50 to about .85. The National 
examinations are also effective tests as measures of knowledge and 
understanding acquired and developed in the course of professional 
education. 

Authors of these tests are careful to emphasize, however, that 
knowledge of subject matter and understanding of teaching proce- 
dures and of human behavior do not necessarily indicate one’s ability 
to apply these effectively in the actual teaching process. Other aspects 
of the candidate’s personality need evaluating if future classroom ef- 
fectiveness is to be estimated: motives for teaching, emotional sta- 
bility, social values, ability to communicate and establish contact with 
others (rapport), attitude toward and concept of the self. 

It is quite improbable that any single personality scale will be able 
to yield useful results for all these traits, Some efforts, however, are 
being made to construct tests that will provide information regarding 
the non-intellective personality aspects essential in teaching. One of 
these is The Case of Mickey Murphy: A Case-Study Instrument for 
Evaluating Teachers’ Understanding of Child Growth and Develop- 
ment! Another is the Minnesota Teacher Attitude Inventory, de- 
Signed to predict how well the prospective teacher will get along with 
Pupils. Both are pencil-and-paper tests that will have to be validated 
against competently rated classroom performance of many teachers 
who took the tests before beginning their professional careers.”° 
TESTS OF SCIENCE AND ENGINEERING APTITUDES 

General Characteristics. Scientific aptitude is not a special 
talent in the sense that musical aptitude, for example, is thought to 
be. Scientific aptitude May be characterized best as the application of 
general intellectual capacity to scientific materials. A test of scientific 


18 By W. R. Baller. University of Nebraska Press, 1948. 

1 By W. W. Cook, C. H. Leeds, and R. Callis. Psychological Corporation, 
i ‘ 

20 Tt will have occurred to the reader that the tests of aptitude for teaching 
Should, strictly, have been included and discussed under “proficiency tests” in 
the next chapter. They are included here, however, because they deal with one 


of the higher professions. 


Se ee oe eee MC SIR SUD mathematics (al 
ests are: i 
gebra), for, ienti i iE 

= ca rmulation of scientific relationships in al ee es 
y science information, arithmetical reasoning. ae deeb 
ary, and comprehension of 


» Scientific vocab 
ompr mechanical ionshi oat 
fem eA ria relationships and Problems 


5 ni authors present Statistical details 
uote reasonably high degree of v: 


ni : è alidi 
criteria. The highest Mie coe Ser ne Sbeas henson 
n 5 aiaia coeffici Q a 
physics and Sheicate ohe cients were found with marks in 


: manufacturing Processes” 

Instrument is more of a test 

subje: a i i 

ubjects than for applied engineering subjects of 
Correlations with grade aver: E 


ranged from .58 (first semester) to .26 i edn eight semesters 
in coefficients being ste secs 


purely scientific 


n inor 
*' By D. L. Zyve, Stanford e from a coefficient of moderate 
22 p oy > Stanford University Pres 
=“? By B. V. Moore, C. J. La y Press, Stanford, Ca orni 

New York. - Lapp, and C. H, Griffin, Psych ifornia, 


ological Corporation, 


40 


Chances In 100 That A 


Or Better First-Term 


CHANCES 
IN 100 


L GRADES 


PORTION OF CLASS BY RANK 
ON TEST SCORES COMBINE 
OR ON GRADES ALONE 


WITH HIGH SCHO 


T, EET 


Next 12%.....0000ee0e+ 


34% just above 


the median........... 


34% just below 


the median. 


Uy ER 


Next 1 


re shown at theoretical lengths- 
at individual schools might differ slightl 


Lowest 4% 


ric, 13.4. Prediction of Scholastic Success in a C 


Test Scores Combined with High-School Grac 


* Data, courtesy of Dr. W. B. Schrader, for 721 enroll 
Fall of 1948 at Carnegie Institute of Technology, Co 


the University of Pennsylvania. 


358 Aptitude Tests: Fine Arts and Professions 


aptitude, therefore, should be regarded as a device intended to indicate 
the probability of success in scientific studies and occupations, without 
implying that such a test measures psychological functions which 
essentially different from those used in other types of mental acti 
An early illustration of this type is the Star oi 
Test (1930) which is intended for high 
Eo Its author states that the sub 
ent; clarity of definition; suspended y K j i 
inconsistencies; fallacies; cane. mite AR a 
caution and thoroughness; discrimination of values i EE, 
arranging experimental data; accuracy of interpretatio $ a ne 
of observation. Though this test has received Ses diel 
tention and more than occasional use 
searching process of standardization ai 


Tests of Science and Engineering Aptitudes 


are 
ity. 
tford Scientific A ptitude 
-school seniors and college 
tests evaluate experimental 


Section (SAT-M) 


Verbal Section (SAT-V) 


plilude Test 
LE 


L 
Predictor 


HEE Composite (SAT-M and 


COLLEGE BOARD TESTS 

M and the Advanced Mathematics 
ND: 

ted average correlation coefficients against 

in engincering college grades 


Scholastic 
AT-M combined with 


high school grades... 


3 


3 
< 


Composite Score—an average of scores on the 


Adtanced Mathematics Test 
EZA High School grades... 


80 100 


ngincering College Grades 
60 


< 
— 
=] 
p 
> 
i 
3 
o 
Z 
io) 
3 
rate 
= 
© 
3 
= 
2 
na 
a 
Ss 
5 
2 
5 
fa] 
fo) 
[e] 
h 
uv 
a 
o 
a: 
G 
a 
< 
fe 
tudent Will Earn Average 


) compiled f i 
noaa aea at et SOPU LEONI E Sees OP sarees eat a E e 


actual lengths 


y) 


Vw 
Xe) 


vs 


ge Board 


y Colle, 


nginecring Colleges b 


ip of Five I 
les. Educational Testing Service (By permission.) * 


ed engineering freshmen tested during their first week in the 
rnell University, Lehigh University, Rutgers University, and 


xTO 


360 Aptitude Tests: Fine Arts and Professions 


magnitude to one that is very low, the usual influencing factors may be 
operative: increasingly homogeneous groups of students, reduction 
of individual differences due to earlier handicaps or advantages, in- 


value than the correlations suggest, 
The Pre-Engineering Ability Test (1952) has only two parts: com- 
prehension of scientific materials and general m 


Evaluation of Science and Engineering Aptitude Tests. 
Purpose of the instruments in this 


other, and if, in addition, the reasons are sought for Significant dis- 
crepancies that might be found between the two Scores of any candi- 
date 


Interest Inventories 361 


complex problems of intellective and nonintellective behavior are en- 
countered in all areas when psychologists deal with learning, school- 
ing, motivation, standards, etc. 

Results obtained with current tests of engineering and scientific 
aptitude can be most helpful when analyzed and interpreted by a 
qualified professional counselor or psychologist. It is not the total 
score alone that might be useful. A profile representing scores on sub- 
tests can be even more valuable; but it is necessary to go still further 
in the analysis of an individual's performance. That is, a detailed anal- 
ysis of performance on each type of subtest and item may reveal an 
individual’s strengths and weaknesses that can then be evaluated in an 
interview with the subject, in the light of other available educational 
and psychological information regarding him. While numerical ratings 
obtained on some engineering and scientific aptitude tests will be 
of significance, considering their present validity coefficients, greater 
value from them will be derived in counseling by a qualified person 
sults in the light of his educational, psychological, 
ights. This is, admittedly, a subjective approach 
5 this type of aptitude tests has not yet 
t sole dependence upon their objective 


who interprets the re 
and occupational ins 
in part; but the development of 
Progressed far enough to warran 
aspects, 


INTEREST INVENTORIES 

An individual’s aptitudes and abilities ordinarily are not so 
highly specific that he can be given guidance solely on the basis of 
aptitude tests, Motivation. determined by one’s interests, values, and 
Preferences, may be the deciding factor in the selection of a course of 
study or of an occupation. Often, however, persons make their selec- 
tions through chance influences rather than because of self-evaluation 
and knowledge of the field. Several devices, therefore, have been pre- 
Pared to assist in the process of self-evaluation; and the results ob- 
tained with these are to be used in conjunction with data obtained 
from other sources. We shall briefly describe two widely used instru- 
Ments, one of which scores an individual’s interests in ard preferences 


for a limited number of areas, whereas the other scores the individual’s 


responses with reference to specific occupations. 


The Kuder Inventories. The first of these is the Kuder Preference 
Record (1934-1951), in several forms, to be used with persons from 


362 Aptitude Tests: Fine Arts and Professions 


grade 9 and onward." The inventory include 


s 168 items, each of 
which lists three activities; for example: 


Build bird houses 
Write articles about birds 
Draw sketches of birds 


likes most and which least. The items, 
covering a wide range of activiti 


a view to determin- 
strong. 
Identification of are; 


listed under each area of prefe 
which are associated with jt 
sideration. Since it is not 
more than one area of strong Preference, Kude 
Cupations of possible interest unde ious paj 
as perak r mechanica. 4 i 3 Preferences such 
tific-social service, Persuasive-lite 
based upon actual data; many, h 
judgment of consistency bet 
cluded in the inventory, 


artistic, scien- 
he listings are 
on the author’s 
and activities in- 


The Strong Inventories. 
(1927-1951) available in separate fi 


el 

sonal Preference Record (1948-1953) This inventory Consists eae a per- 
describing different types of personal and social activities, T| scales Scales 
garded as being relatively independent of th 


employee placement. 
A Stanford University Press, 1938, 


Interest Inventories 363 
TABLE 43 


Percentile Ranks of Mean Scores in Various Occupations "A 
(Kuder Preference Record) 


Occupation Outdoor Mech. Com. Sci. Pers. 
Men z 
Accountants 42 27 94 43 3 
Civil Engineers 65 57 71 66 18 
Lawyers and Judges 41 19 48 36 57 
District Morest Rangers 8+ 50 56 54 32 
Office Managers and Chief Clerks 37 29 77 39 57 
Sales Managers 32 30 32 38 / 
Women m ae 
Librarians 41 46 30 22 44 
ans 84 TI 47 86 22 
al and Welfare Workers 51 B 31 43 34 
Occupational ‘Therapists 70 8S 32 58 26 
Prained Nurses 51 57 43 69 33 
Secretaries 51 43 49 41 64 
Sales Clerks 
Occupation Art. Lit. Mus. Soc. Cler. 
Men e 
Accountants 34 63 56 36 83 
Civil Engincers 67 35 42 25 55 
Lawyers and Judges 43 82 61 48 n: 
istrict Forest Rangers 22 54 36 33 29 
Jffice Managers and Chief Clerks 45 65 39 43 66 
Sales Managers +l d 59 47 43 
Women 66 $3 52 29 38 
y A 60 58 45 63 11 
Social and Welfare Workers 59 ; A 81 25 
Occupational Therapists 81 4 6 58 14 
Trained Nurses 58 46 54 5 26 
Sceretaries 50 64 61 39 58 


Sales Clerks 


(or women) one would like most and least to have been, positions 
one would like most and least to hold in an organization, comparison 
of interests between paired items, and self-rating of present abilities 
and traits, 

The purpose of the inventory is to find the extent to which an indi- 
vidual’s interests and preferences agree with those of successful per- 
sons in specified occupations: forty-one for men, twenty-five for 
women. The inventory, scored separately for each occupation, yields 


25 Erom Examiner Manual, Science Research Associates, Chicago, 1951, 
Pp. 14 ff. (By permission.) 


364 Aptitude Tests: Fine Arts and Professions 


Group ~ Key number. 


= Forms M 
VOCATIONAL INTEREST BLANK FOR MEN (Revi 


By EDWARD k. STRONG, JR. 

fester el Paychotegy, Stanford University 

rons Usrvaasiry Pass, Stasford University, Calilorala 
Part L Occupations. Indicate after each occupation listed below whethi 
Distegard considerations of salary, 

to do what is involved in the occu 
whether or not you would enjoy that 
may not possess. 


Draw a circle around L if you like that kind of work 
D + : 


Prof 
Poblithed by Stan 


+ future advancement 


Permanently, but merely 
fining which you may or 


1 Actor (not movie), 


E eas E er $ 
2 Advertiser L 1 D 47 Judge .. a p n 3 
3 Architect bY OD aera: oo; a oe 
4 Army Oficer ot Oh lg: Laboratory Technician , ei E ap 
S Anise. a O 9» Landscape Gardener „ T E Gp 
Pant Skil ex Part IV. Activities. Indicate Your interests as in Part 1, 
terest when in school, 


101 Algebra. 
102 Agriculture > 
103 Arithmetic... 
10t An... 

105 Bookkeeping 


a carburetor 
1B electrical 


Pan V. Pecullarities of Peo; 
Pression. Do not thi 
tional cases, “Let yourself po 


Par IL Amussmonts, 
Part I whether you like 
cons 


234 Progress le 
235 Conservative per 


236 Energetic people . 
237 Abscnemiticl: 
238 People who borrow thi; 
29 tempered people 
imisa P 


ings 


ich three of the followis 
lso indicate which the ould en 


hies e 
eo you would of You would enjus ruri 
ia column 2. TOA would enjoy leaut by ‘beching opposite them in 
1 2 3 


OF ES D the theory of tion of a new machine, eg., 
(C) 0) ) Srnec ror X sav ante, 
¢ ) is prov t in design of 

Bek ts Determine ie ate oper Guo chy 
LE ¢) t3 Supervise the manufacture of the machins 


Ba ths hen S Ptiaon ol Interest between Two loma, Indica r choice fol 
Ín the first space if you prefer the item te the keh, in the second specs if oe lke bet y=") 
mnao tam tee ah Ame eer ings we sen cs 7 ie bo = 


t by check: 
wel, sad ts nn (v) 
compared 


EESE S a 
(I f C) ry ret 
Chet 
CEESTI i thouse tender 
to house canvassing IEIET mee 


Part VIIL Ra Indicate below what kind of a pe 
what you have done. Check if the item really 

tem does not describe you, a 2) if you ate not 

for selection of i 


as well se your 


Usually start activites of my group...... 
ie Usvolly drive myself steadily (do not work be 
263 Win friends easily... 3 

264 Usaally get other people to dö what I want done. 
36S Usually liven up the group oa a dull day. 


TABLE 44. Sample items from Strong Vocation 


1al Interest 
Standard University Press, (By permi: 


ssion.) 


Blank. 


Interest Inventories 365 


ratings which are considered to be indicative of the subjects’ interest 
in each, It is possible, also, to score the inventory for six “occupa- 
tional groups,” rather than for specific occupations, in instances where 
one wishes to know which broad fields of occupation are indicated. 
The inventory may, in addition, be scored for “non-occupational” in- 
terests, found to be useful in guidance: namely, interest maturity; 
masculinity-femininity (based upon percentile scores of males and 
females on interest items); occupational level (differences in interests 
between “laboring men,” on the one hand, and “business and profes- 
sional men earning $2500 and upwards a year,” on the other); studi- 
Ousness (“factors which contribute to scholastic achievement that are 


not measured by intelligence tests”). 


= 


eee in 8 i 
Percentage Percentage Sean velghts 
First ‘Ten Items £ “Men-in- of Engineers Sey Peia 
$n Vocational “General” ee E | ae Thera 
aie oe ee L TESE E 
jetpat Boe es = — 
kauaa a ez | 47 | 9 | 31 | eo |12| -1+18)-1| o] 1 
idl ena A 38 | 29 | 14 | 37 | 49 |—19) —1 +20|—2f| 0| 2 
verses] 33 | Šo | 23 | 58 | 32 | 10 [+21] —8|—13| 2 |—1 |—1 
37 | 99 laa | a1 | 33 | 36 | +9) +4)-13) 1) 0—1 
a4 | 40 |38 | 28) 92 | 3 | ial olaj a| ej-i 
Astron. 26 | 44 | 30 | 38 | 44 u E a 
Athletle director | 26 | 41-| 88 | 15 | 51 | 34 MHI Hea) Tl 2 
Uctioneer 3 | 27.| 6 | 1| 16 | 83 | —7/-1)+18}—1)—1) 2 
Aer aa | ae | os fae | | | a a ee) a 8 
Auth e 
yy ea am ved VB ec Re a Hada 


TABLE 45, Determination of weights for an Occupational Interest 
Scale: Engineering * 

The procedure followed in determining the significance of and the 
Weights to be given particular responses is, in outline, as follows: 7° 


1. Blanks are filled out by an adequate sampling of an occupation 
and by a much larger sampling of ‘“‘men-in-general.” 
2. Responses to each item are tallied according to the three cate- 


Sories: like, indifferent, dislike. 
3. Frequency of responses to each item, in each of the three cate- 


*° Cf. E, K. Strong, Vocational Interests of Men and Women, Stanford Uni- 
versity Press, 1943, p. 74. The formula for determining weights, p. 611. 

* From E. K. Strong. Vocational Interest of Men and Women. Stanford Uni- 
versity Press, 1943, p. 75. (BY permission. ) 


366 Aptitude Tests: Fine Arts and Professions 


gories, are calculated in terms of percentages. Table 45 presents the 
Percentages for the first ten items on the blank, as responded to by 
“men-in-general” and by engineers, 

4. Weights are calculated for each of the fre 


quencies of each of the 
three responses—like, indifferent, 


dislike—to each item. Table 45 


| airone 
H i nterest Scales 
Scoring Responses el 
First Ten Items on | Welghts tor otan | ing In- Dat sa fare ee eee 
the Vocational Engineering Engineer | terest 
Interest Blank Interest to the Ob- Life 
n Items tained |Law-| Insur- | atin. TMOA, Ac 
wis yer gee, Ister Seere: ges 
Pepe at] as í 
Actor (notmovie)|—1| 0) 1... si aaka =2| i ee 
Advertiser, | | à ej. cl A ae o calea 
Architect.. ie pce ee ear 1 0 
Army officer. O;-1) .. | oj 0 TEE o 0 
Artist RN A o] a gra 8 0 
Astronomer .....| 1) o1! xi. 1 0 =] 2 o | 0 
Athletic director |—1 | 1) O}.. | BT) Ol, w Oy 0 0 
Auctioneer......, —=1|—1| y l <4 0 0 o` 0 
Author of novei.|—1| 1| o|.: | = | AHO 0 N 
Author of tech- | | 
nical book. 3|—1!_9| x | zan 3 Oj ilii 1 1 
Se esl A paneer 
Tatal AO ftemaie.ssncssvcers: seivosseas +7 0; 3 ol š% iā 
tal 400 items n EE 
Sn scor P 13 =e =a = E 
Rating A] B 
i Cc Cc 
TABLE 46. Scores obtained by an engineer 


: j- for engir i š ope 
also scores for interest in five other oce i a iE interest; 
ZMN: 


Tiiterest Blank the method 
shows that 21 percent of “men-in-general” but 
gineers would like to be actors. For this item, a calculateg Weight of 
1 is obtained; and it is given a minus sign bec: a igh 9 
marked “like” for this item. Essentially, the wei i SPguicers 
pends upon the difference in percentages found between the sà f 
an unselected sampling of persons and that of the occupation for 


which the item weight is being determined. Table 46 illustrat i 
and weights for the ten items in Table 45. es scoring 


Evaluation of Interest Inventories. The Kuder ang 


À oes the Stro 
tories are not tests of aptitude; they indicate only t 


ng inven- 
he exte; 


nt of simi- 


28 From E. K. Strong, op. cit. P. 75. (By permission. ) 


Interest Inventories 307 


larity between interests and preferences expressed by those being 
examined and those of persons successfully engaged in the specified 
occupations or areas. It has been found that persons successfully en- 
gaged in a particular occupation or group of occupations have, in gen- 
eral, some communality of preferences and dislikes which differentiate 
them from persons in other occupations and groups. The principle 
underlying these and similar inventories is, then, that an individual 
who has a pattern of preferences and dislikes similar to the distinguish- 
ing aspects of a given group Or occupation has a greater chance of 
finding that type of activity congenial and, hence, of succeeding in it, 
provided, of course, that he has also the degree of aptitude required. 
The differentiating value of the Strong inventory is illustrated in Fig- 
ure 13.5, which “contrasts rather markedly divergent occupational 
groups in respect to interests. 

These two inventories differ in their approach to the problem. The 
Kuder attempts to identify important aspects of vocational interests in 
broad occupational areas. A probable occupation or a restricted list 
of occupations may be suggested by a score in a single area. Or still 
another restricted list may be suggested by a combination of two or 
More areas, The Strong, on the other hand, aims primarily to provide 
Patterns of preferences that distinguish one specific occupation from 
others, as rated by the same inventory. 

Reliability and Validity. While reliabilities, in terms of internal 
Consistency, of both the Kuder and Strong inventories are satisfactory 
(the coefficients falling generally between .85 and .95), the more 
difficult problem, by far, is demonstration of their validity, For the 
Strong inventories, a number of different criteria were used: namely, 
Mean scores and standard deviations of criterion (specific occupa- 
tional) groups compared with a general sample, relationships of scores 
to other test scores (of a variety of types), correlations with grades in 
Schools and colleges, completion of occupational training, ratings of 
Success in work, earnings (in sales work), persistence in occupations, 
job Satisfaction, and differences between occupational groups. Ex- 
cepting the correlations found with tests of intelligence, educational 
achievement, personality traits, and with school and college grades, 
these criteria, all already more or less familiar to the student, have 
been found to be related significantly to the scores on the Strong in- 
ventories, 

The Kuder inventories have used much the same types of validation 


Aptitude Tests: Fine Arts and Professions 


100 Certitied Public 
countanti 


30 35 40 
Standard Score 


ric. 13.5. Distribution of standard sc 


‘ores of five 
Artist Scale. From E. K, Strong, 


i Occupations on 
op. cit., p. 11], y permission, ) 


Interest Inventories 369 


criteria. In addition, scores in the separate areas have been intercor- 
related. These reported coefficients have ranged from —.34 (scientific 
with persuasive) to +.50 (clerical with computational). Ideally, of 
course, low intercorrelations are desired; for then the scores will have 
greater differentiating value. 

Both inventories were validated, also, by means of item analysis. 
Essentially, the problem would be to examine an adequate number of 
responses to each item by known groups, to determine which of them 
are selected and rejected most often by persons strongly interested or 
successful in a given area (Kuder) or occupation (Strong). For ex- 
ample: do high scorers in the “Scientific” area select an item more 
often than the low scorers in the same area? Do high scorers in a given 
area tend to choose the same items? How frequently, say, do success- 
ful engineers select a particular item? 

In this connection, Kuder has found that “mean profiles” for oc- 
Cupational groups “. - - indicate in general that the names assigned 
to the various scales are appropriate in terms of the type of occupation 
entered as well as in terms of the activities for which the scale is 
scored, Chemists are found to be particularly high on the scientific 
Scale, writers on the literary scale . . . ,” etc. He reports, also, that 
Scores on the preference record are consistent with students’ choices 
of occupations and of college curricula. Strong likewise finds that his 
inventory differentiates persons successfully engaged in one occupa- 
tion from those successfully engaged in others. A second criterion em- 
Ployed by Strong is the extent to which the inventory distinguishes 
between persons who are successful in an occupation and those not 
SO successful in the same work, In this respect, too, Strong reports 
Satisfactory results. 

The scores obtained with the Kuder preference record have in gen- 
eral a low correlation with educational achievement tests, with course 
marks, and with tests of general intelligence. The Strong inventory, 
when correlated with tests of intelligence, yielded coefficients which 
were low or negligible, some of them even being negative. These re- 
sults indicate that there is but a very weak relationship, or none at all, 
between general ability and occupational interests. When, therefore, 
these and similar devices are used in guidance, they must be supple- 
mented by measures of general ability, among others, especially in 
attempting to determine occupational level—for example, as between 


370 Aptitude Tests: Fine Arts and Professions 


clerk and accountant, or machinist and mecha 
cian and scientist. 
Stability of Interests. 
years of these and simila 
today, is the extent to w 
inseparable factors of agi 


hence, the question of the dependability and pr 


nical engineer, techni- 


studies comparing in- 
, from age 15 to 55. It 


hat closer correspond- 
age 25. This interesting 
Sts, searching, trial-and- 


ence between 15-year-olds and men above 
phenomenon may be due to shifting intere 
error, and unrealistic concepts of the self, ac 


gations, however 
up as college fre 
her group as colle 


» the same persons were 


examined twice: one gro shmen and again after a 


period of nine years; anot 
ten years. Correlations b 


isa 
ward stabilization after college graduation, 


Data dealing with stability of interests 
of high-school pupils of both sexes Cin gr 
re-examined as college seniors, yielded a 
again only a moderate degree of corres 
Studies of the Kuder, dealing with the same question, indicate that 
over periods of about a year, the scores on this inventory Raed 
change greatly, in the case of an adult group. This is a finding to a 
expected. Other longitudinal studies indicate that s 


i F Only moderati 
changes in score occur during the high-school and college Neen a 


and preferen 
ades 10 and 
Correlation c 
Pondence, 


ces of a group 
11), who were 
Oeflicient of .52 


General Evaluation of Aptitude Tests 374 


is not entirely in accord with the data on the Strong; but the difference 
is probably due to the fact that the latter measures interests for specific 
occupations, whereas the Kuder does so for broad areas. 

It is not surprising that changes in scores are found over a period 
of time. Many considerations and factors other than ability and pref- 
erences influence vocational choices which, in turn, have their influ- 
ence upon subsequent interests. These influencing factors are often 
subtle, unpredictable, and frequently unknown. Under these circum- 
stances, the results and information provided by these inventories are 
very creditable. 

Applicability to Different Age Groups. With whom should such 
occupational and interest inventories be used? Obviously, they can 
have validity only for persons whose lives have been long enough and 
varied enough to have provided them with experiences of the kind 
which will enable them to choose between the alternatives presented 
by each item in the inventories. The Kuder Preference Record, con- 
cerned with broad areas of interest, is standardized for high-school 
students (beginning with grade 9), college students, and adults at 
large, The Strong Vocational Interest Blank, being concerned prima- 
rily with specific occupations, is intended for ages seventeen and over. 
Since the Strong inventory has been based upon responses of adult 
men and women, more valid and useful results will be obtained with 
adults than with persons in their teens. Since the Kuder record has 
been standardized with high-school and college students, as well as 
with adults, it may be used appropriately with adolescents. But even 
SO, the interests, values, and attitudes of adolescents are still in a state 
of flux and are as yet not fully developed; hence the results of the 
Preference record, when used for guidance purposes, must be inter- 
Preted with this factor in mind. 

With either a high-school or a college student, the scores and the 
profile obtained with the sounder inventories in this category are use- 
ful as an introduction to the study of occupations which involve activi- 
ties of the sort for which he has indicated a preference, and during 
interview and counseling to check the individual's choice of an occu- 
pation against his expressed interests and preferences. For purposes of 
guidance at the secondary-school and college levels, it appears that the 
Kuder has the greater value because it is less specific. 

Both instruments under discussion are intended to provide meas- 
ures of motivation in various fields of study and work. Their applica- 


i 50 Cz m- 
colleges and professional schools also can er 
ploy the same principle if they 


z j ? d 
are satisfied with a mechanical re 
i i 7 nti V 
Impersonal procedure, For this purpose, tests of even relatively lo 
or .40) have some value when sup 


* has been used." The term 

Tsons selected to the number 
00 individuals Were examined for a specified type of 
i l of law, for example, and if 


» the selection ratio would be 


» the degree of superiority de- 
pending upon the size y ty coefficient, For example, 
assume we have an relates .50 with the criterion 
(e.g., level of perform Ssional school), Assume also that 
of the candidates for selecti 


probable average 
cess,”? 

Otherwise stated 
selected group are 


r R ME of those retained are 
likely to prove satisfactory if the validity coefficient ; 

For example, if the coefficient is -20, the Percent satis 
probably be 56; if the coefficient is 30, t 


actory will 
he Percent is 60 


> if the co- 


31 See H. C. Taylor and J. T. Russell, “The Relationship of Validi $ 
efficients to the Practical Effectiveness of Tests in Selection: Dirun Co. 
Tables,” Journal of Applied Psychology, Vol. 23, 193 


ion and 
Taylor aid ® ie PP. 565-578. 
#2 For details of this method see aylor an ussell, op, cit. Also, voce 
Baeu es Concerning the Use of the Taylor-Russell Tables in mith, 


Mployee Selec. 
tion,” Journal of Applied Psychology, Vol. 32, 1948, pp. 595-600, ec. 


Steps in an Aptitude Testing Program 375 


efficient is .40, the percent is 63; if the coefficient is .70, the percent is 
75. Tables of relationships have been calculated for combinations of 
selection ratios ranging from .05 to .95 and coefficients of validity 
from .00 to 1.00, These tables provide a useful general index for esti- 
mating the probable value of a test of given validity in a given situa- 
tion. 

We must re-emphasize that this technique yields indexes applicable 
to a group, but it does not show what will necessarily happen in the 
case of a particular individual within that group. The utilization of the 
selection ratio does, however, add to the utility of aptitude tests when 
group trends and a minimum of elimination from a job or course of 
study are the first concerns. When the individual himself is of first 
concern—as he is to clinician and counselor—aptitude-test results 
must be used only as one kind of information in a total picture. 


STEPS IN AN APTITUDE TESTING PROGRAM 


Although tests for only very few of the many occupations, in- 
cluding professions, have been presented, the procedures and problems 
are the same in all fields. While tests are available in certain specified 
areas (e.g., mechanical, scientific, verbal, etc.) the particular com- 
binations and emphases for a given vocation will depend upon the 
characteristics and demands of each occupation. And certain occu- 
pations will need especially devised tests in addition. For example, the 
Study of nursing, dentistry, and medicine have much in common, but 
each has its own aspects. Thus, many occupational aptitude tests re- 
quire “custom made” batteries that consist of some parts common to 
others and some parts specific to themselves. 

When instituting a program of aptitude testing for the prediction of 
performance and selection of individuals for any area, however, the 
Procedures are the same as for any other. The following steps are in- 
dicated, if one is starting at the beginning: 


(1) Analysis of the task to determine the traits that contribute to 
Success or failure. 

(2) Development or choice of tests and other evidence for trial 
Purposes, 

(3) Use of tests, etc., with an experimental group of subjects. 

(4) Selection of validity criteria and obtaining validity data re- 
garding performance of the experimental group: 


376 Aptitude Tests: Fine Arts and Professions 


(a) Simple correlations of each test and each of the other ratings 
with performance ratings or scores: 


(b) Multiple correlations of the combined scores 


and ratings with 
performance ratings or scores; 
(c) Percents of successful and unsuccessful individuals getting high 
scores and low scores on the tests; ği 


(d) Critical scores (“cutting scores”) which show the test score or 
rating level at which the probable percentage of failures in actual 
performance begins to be so high that persons falling below it are 
poor risks. 


(5) Selection of tests and 


procedures based Upon validity 
standardization data, 


and other 


14. 


amna antnu 


TESTS OF EDUCATIONAL 
ACHIEVEMENT 


SCOPE 

A test of educational achievement is one that is designed to 
measure knowledge, understandings, or skills in a specified subject, 
or group of subjects, taught in school. The test might be restricted 
to a single subject, such as arithmetic or history; or it might be a 
“battery,” that is, tests of several areas of subject matter yielding a 
separate score for each subject and a total score for the several sub- 
jects combined. 

Tests of educational achievement differ from those of intelligence 
in that: (1) the former are concerned with the quantity and quality 
of learning attained in a subject of study, or group of subjects, after 
a period of instruction; (2) the latter are general in scope and are in- 
tended for the measurement and analysis of psychological processes, 
although they must of necessity employ some content which has been 
acquired and which resembles some of the content found in achieve- 
ment tests, 

Most tests of achievement are devoted very largely to the measure- 
ment of the amount of information recalled or skills and techniques 
acquired. In more recent years, however, an increasing number have 
been devised to measure such educational results as problem solving, 
drawing inferences from subject matter (inductive thinking), apply- 
ing generalizations to specific situations and problems (deductive 
thinking), attitudes and appreciations developed by the study of 
course materials, and practices and skills developed in the study of a 


380 Tests of Educational Achievement 
ation. The psychologist 
ults in evaluating a pu- 


andardized achievement 
are the subjective marks of individual 
teachers, though the latter are not to be disregarded, 


DERIVED INDEXES 


y may be converted into 
€vement age,” by means of a table of age- 
norms. Raw scores may be converted, also, into “grade equivalents,” 
using a table of grade norms, 

The reader is already familiar wit 
with the method whereby it is deriv 
on a standardized test of the four fu 
gives him a grade equivalent of 4.5, 
this test is equal to that of th 
completed half of the fourt 
for each of the several su 
profile for each pupil an 
achievement, the evenne 


h the concept of the norm and 
ed. Thus, if a Pupil’s raw score 
ndamental arithmetical processes 
it means that his achievement on 
e average achievement of Pupils who have 
h grade. By means of this index, obtained 
bjects of instruction, it is possible to get a 
d thereby to evaluate his general level of 

of his Performance, his 
the type of learning measured, j 

upil’s a standing i 

the several subjects in which he has been ieai ree 
that a pupil has been given a test battery on Which his scores Ne 
follows: reading rate and c 


Omprehensi 9 es were as 
ston, ‘9-year level; arith etic 
fundamentals, 10-year level; spelling, 9-year 3 met 


10-year level. His EA would be the 
the separate ages represents the me 
achievement of the pupils of that age. The EA is a composite, es 
senting a pupil’s average achievement. 

Achievement age (AA) designates a Pupil’s leve] of 
a single school subject. The “subject age” 
mously with AA. Thus, if a pupil gets a rating of achiey 
on an arithmetic test, his level is equal to the norm of perez 
pupils, as measured by the given test. it” subject age” is used, it would 
be said that this pupil’s “arithmetic age” is 10. At pre. ; 


: Performance in 
Is Sometimes 


Derived Indexes 381 


ter of personal choice whether to designate a pupil’s performance on 
an arithmetic test as “achievement age (arithmetic)” or “arithmetic 
age.” 

Grade norms, educational ages and achievement ages derived for 
different tests are not always comparable, nor are they always applica- 
ble in a given school or community, since the standardization popula- 
tions of these tests vary in respect to adequacy and representativeness. 
This is an important problem in achievement testing, since quality of 
education is far from uniform among various parts of this country. 
Rather than deriving only national norms, it is much more meaningful 
to present, in addition, separate norms for different sections of the 
country, and even for different types of communities (according to 
population). In connection with norms, it is necessary to recall the 
earlier distinction made between norms and standards (Chapter 2). 
Norms, in this instance, will represent the levels of manifested achieve- 
ment in the subject matter being tested. They do not necessarily repre- 
sent the level of learning that might be desirable or optimal—that is, 
a standard of performance. 

It is necessary to have a quotient to accompany an educational or 
an achievement age. Hence, there are two types of quotients used with 
tests of educational achievement when EA or AA is found: namely, 
the educational quotient (EQ) and the achievement quotient (AQ). 
The latter is sometimes called the accomplishment quotient. 

The educational quotient is the ratio of educational age to chrono- 
logical age (EA/CA) multiplied by 100 to remove the decimal; that 
is, the individual’s average level of measured learning in relation to 
what is expected on the basis of his life age. Theoretically, the “nor- 
mal” EQ is 100; deviations above or below represent, respectively, 
superior or inferior school learning as compared with the individual’s 
age group. 

The achievement quotient is the ratio of educational age to mental 
age (EA/MA), multiplied by 100; that is, the individual’s average 
level of measured learning in relation to what is expected on the basis 
of his mental level.’ Due to marked individual differences in mental 


1 The AQ is also given as EQ/IQ. This is the equivalent to EA/MA, but in 
puoi: form. Thus: IQ = MA/CA; EQ = EA/CA. Substituting for EQ/IQ, we 
ave 


EA/CA 
i Baria A/M 
MA/CA or EA/MA. 


8 Tests of Educational Achievement 
382 


ability in any group of persons of the same life riot the man P 
regarded as a more reliable index of a person’s learning capac y ; A 
is chronological age; therefore, the AQ is a more valuable index tha 
the EQ in judging whether or not a pupil’s school achievement is com- 
mensurate with the quantity and quality of learning that might reason- 
ably be expected of him. i E 
The educational quotient and the achievement quotient have been 
used not only when the composite EA is derived and taken as the 
numerator of the ratio, but also when only the age level for a single 
subject (subject age or achievement age) is used as the numerator. 
When this is the case, then the quotients should be read as “educa- 
tional quotient in . . . .” (naming the school subject tested; e.g., 
arithmetic) ; “achievement quotient in . . . one 
The value and wisdom of using the A 
tioned, for the three following reasons. ( 
from two separate tests, each of which h 
unreliability, it has a lower reliability th 
test alone. (2) Since the norms of the 
achievement and the other of general me 
bility been derived from different stand 
be strictly comparable, nor will their 


ample, unless the distributions of Scores of the two standardization 
groups are approximately equal, a given EA (say, 10) and EQ (say. 
110) will not be comparable with an MA of 10 and an IQ of 110. 
Unless the two distributions are equivalent, or nearly so, the two sets 
of apparently identical indexes will } 


have different v 
Hence, an index derived from the 


Q have been seriously ques- 
1) Since this index is derived 
as a certain degree of error or 
an an index based upon either 
two tests, one of educational 
ntal ability, have in all proba- 
ardization groups, they will not 
measures of dispersion. For ex- 


aning 
s are geared 
an opportunity 
apacities; hence, they will 


100. Slow pupils, on the 
S,” so to speak, and will 


to pupils of average ability, 
to learn up to a level consistent with their ¢ 
be penalized and will get AQ’s of less than 
other hand, will be working “over their head 
more often get AQ’s above 100.2 In spite of these criticisms, however, 
and assuming that one is aware of its limitations, there are occasions 
when the AQ serves a useful purpose, especially in studying the educa- 
tional problems of superior children, 


? This criticism does not seem to h. 
is that the AQ helps to reveal an u 
superior pupils are concerned. 


ave much merit; fi 


1 or what it says, in effect, 
ndesirable educati 


onal situation, so far as 


uw 
oo 
ws) 


Types of Items 


TYPES OF ITEMS 


F Test items may be classified into one or another of the fol- 
lowing major types: (1) simple recall; (2) two alternatives, such as 
true-false, right-wrong, yes-n0; (3) multiple-choice; (4) completion; 
(5) matching; (6) analogies; and (7) check lists. The following as 
amples illustrate these types. j 


(1) Simple recall: 
What are the dates of World War I? .»..ssseeseeree. 
What are the two main gases found in water? ..... 3 ; 3 : 

(2) Two alternatives: 

NaCl is the chemical symbol for common salt. T F 
Abraham Lincoln served two complete terms as president 
of the United States. 
(3) Multiple-choice: 
Scrooge is a character in 
1. Oliver Twist 
2. David Copperfield 
3. A Christmas Carol. 
An example of an appointed official is a 
1. congressman 
2. senator 
3. federal judge. 
(4) Completion: 


The executive head of the United States government is the 


, while the federal legislative bodies 


are the and the 
(5) Matching: 
Directions: After each name write the number of the topic 
which is intimately associated with that person. 
(1) conditioned reflex tekone aa 
(2) age scale for testing intelligence A haan 
(3) reaction-time experiments Joh T Ramee 
(4) psychoanalysis Carell M $ $ 
(5) psychology of adolescence T E ; mae 
(6) existential psychology tae Othe hock 2 ; ; 
(7) factorial analysis 
(6) Analogies: 
Executive functions : President :: legislative functions : .... 
Hydrogen : H :: sodium: ..- 5 


(7) Check lists: F 
Directions: Place a check mark in front of each of the follow- 


ing items which is part of an automobile: 


g Tests of Educational Achievement 
354 


throttle generator 
rudder distributor 
gear shift aileron 
periscope stabilizer 


While these are the principal types of items used in tests of educa- 
tional achievement, they are not found with equal frequency; true- 
false, multiple-choice and completion are the most common. There 
are also variations on some types. For example, tests of paragraph 
meaning (e.g., in literature, social sciences, physical sciences, and 
others) are quite common. The testee reads a paragraph and is then 
required to answer questions intended to show the extent to which he 
has comprehended the paragraph. The questions usually are in the 
form of true-false, multiple-choice, or completion. In arithmetic and 
other aspects of mathematics, the items in a standardized test usually 
are in the same form as those devised by the individual teacher: that 
is, the student simply provides the correct a 
aspects of English, special ty 
example, in tests of punctu 
illustration the necessary ca) 
supplied, 


nswer. In tests of some 
pes of items have been developed, as, for 
ation and capitalization. In the following 
pitals and marks of punctuation are to be 


why did you come home so early mary asked 


THREE REPRESENTATIVE BATTERIES 


The number and range of tests of educational achievement are 
very extensive; and they vary markedly in merit, Whenever a school 
or clinical psychologist is faced with the necessity of evaluating educa- 
tional achievement of a group or of an individual, the selection of tests 
must be based upon their appropriateness to the problem at hand and 
upon their adequacy of standardization: range of grades covered, as- 
pects and comprehensiveness 
validity 


present a comprel 


hensive list to famil- 
iarize the reader with the nature of these instrume 


nts, 


The Stanford Achievement Tests.’ 


These provide batteries at four 
levels: primary, elementary, 


intermediate, and advanced. The primary 


Three Representative Batteries 385 


battery (for use at the end of grade 1, in grade 2, and in the first half 
of grade 3) includes tests of paragraph meaning, word meaning, spell- 
ing, arithmetic reasoning (concepts and problems), and arithmetic 
computation (the four fundamental processes). 

The elementary battery (for grades 3 and 4) includes tests of para- 
graph meaning, word meaning, spelling, language (mechanics and 
usage), arithmetic reasoning, and arithmetic computation. 

The intermediate battery (grades 5 and 6) and the advanced bat- 
tery (grades 7, 8, and 9) include the same tests: namely, paragraph 
meaning, word meaning, spelling, language, arithmetic reasoning, 
arithmetic computation, social studies, science, and study skills. 


The California Achievement Tests.’ These are designed for a very 
wide range of educational levels: grades 1 through 14. The four bat- 
teries (primary, elementary, intermediate, and advanced) all include 
the same general tests, with changing content and of increasing diffi- 
culty, of course, These tests are: reading vocabulary, reading compre- 
hension. arithmetic reasoning (mathematics reasoning in the advanced 
battery), arithmetic fundamentals (mathematics fundamentals in the 
advanced battery), mechanics of English and grammar, and spelling. 
Each of these, with the exception of spelling, is divided into several 
subtests that are regarded as the component parts of the broader cate- 


gory of school learning. 


The Iowa Tests of Educational Achievement. These are less compre- 
hensive in scope than either of the foregoing batteries, being devised 
for mid-8th grade to mid-13th. There are nine subtests: understanding 
of basic social concepts; general background in the natural sciences; 
Correctness and appropriateness of expression; ability in quantitative 
thinking; ability to interpret reading materials in the social studies and 
Natural sciences, and literary materials; general vocabulary; and uses 


Of sources of information. 


Evaluation of the Stanford, California, and Iowa Batteries. These 
three batteries. briefly described, are among the most widely used. 


“By T. L. Kelley, R. Madden, E. F. Gardner, L. M. Terman, and G. M. 


Ruch. Published by World Book Co., Yonkers, N. Y., 1953. 
“By E W. Tiees and W. W. Clark. Published by California Test Bureau, 


Los Angeles, 1933-1951. 


86 Tests of Educational Achievement 
3 


They, and other similar batteries, have much in common in See 
to objectives, standardization numbers and range of population sam 
ples, reliability, and conception of validation, 
These batteries have been current long enough so that their norma- 
tive data, for each form, are based upon many thousands of cases: 


Stanford, 70,000 to 106,000; California 50,000 to 100,000; Iowa, 
50,000. In each instance, the cases were distri 


graphic area; particularly the Stanford, the n 

rived from school systems in 38 states. 
Reliabilities of the component parts 

within the range considered satisfactory: 


buted over a wide geo- 
orms of which were de- 


of all three are very largely 


California: .83 to .96 


Iowa: 81 to .94 
Stanford: .73 to .95 


Validity of these and similar tests is “curricular” 
validity). This criterion of sele 
area. The ultimate value o 


validity (content 
cting test content is justifiable in this 


defined; an analysis 
es, forms and 
and school 
t content are 
all the skills, types of 
xamined. The outlines 


ies of try-outs, involving the usual 
standardization techniques, during which items are rey; 


* See, as the most comprehensive 


single source, O., K, Buro: 
and Fourth Mental Measurement F 


SER Aa aA ion 
s (editor), Third 
earbook, 1949 and 1953 


> respectively, 


Reading Tests 387 


eliminated, until satisfactory reliability and validity are achieved. To 
a considerable extent, these rigorous standardization procedures were 
used in constructing the three batteries under discussion, particularly 
so in the case of the Stanford. 

These batteries are intended to be analytical of pupil achievement 
in these ways: (1) they measure continuity of educational growth 
over a wide range of school grades; (2) they reveal group or class- 
room differences in subject-matter, skills, or insights being tested; (3) 
they reveal differences in competence within a single individual; (4) 
they identify those pupils who are so markedly deficient in any area as 
to require more intensive testing, observation by their teachers, and 
diagnosis of the elements of deficiency, with a view to remedial in- 
struction, The comprehensive survey batteries under discussion are 
not designed for this kind of deficiency diagnosis, though they con- 
tribute to this end in a limited degree. For example, if a pupil earns a 
very low score on the arithmetic test in the battery (for reasons other 
nce), a diagnostic test in arithmetic (either 
standardized or especially devised ) should reveal in which of the four 
fundamental processes and at which levels he is deficient; in which 
number combinations or specific or isolated skills and understandings 
he is weak. 

On the whole, achievement batteries of the quality of these three 
can contribute much to the study and solution of the types of educa- 
tional and psychological problems presented under the section on 


“Purposes” in this chapter. 


than poor general intellige 


READING TESTS 

Reading is the school subject to which most research has been 
devoted and for which the largest number of tests have been devised. 
This is quite understandable, since reading ability is the keystone of 
the usual school curricula through elementary and high schools, and 
also in most courses of study at higher levels. The numerous tests in 
this field have much in common, but they are not identical as to con- 
tent, some being more comprehensive than others, Reading tests may 
be divided into three groups: achievement, diagnostic, and readiness. 


Achievement. These tests are concerned with the measurement of 
how well the testee is able to read: that is, how rapidly and with what 
degree of understanding. For these two purposes, most current tests 


88 Tests of Educational Achievement 
3 


measure rate of reading, sentence or paragraph comprehension, = 
meaning (vocabulary), and recall of details. Less often, reading tes 

measure knowledge of abbreviations, selection of key words, and use 
of an index or directory, The tests in this category are intended to 
measure achievement in each of the several areas and thereby to indi- 
cate, at the same time, the causes of poor reading on the part of indi- 


viduals of whom a higher level of reading competency might be ex- 
pected. 


Diagnostic. Such tests are intended, through analysis, to isol 
detail the factors responsible for a 


that is, factors other than lack of 
tests and the elements in reading, 
of the approach to the problem a 


ate in 
n individual’s reading disabilities; 
general intelligence. The following 
analyzed by each, are representative 
nd of the factors being isolated, 
Diagnostic Reading Tests. Grades 7-13, 
1. vocabulary: 
social studies, 
2. comprehension: silent 
3. rate of reading: genera 
4. word att 
labication, 


in the fields of English, mathematics, science and 


and auditory 
l, social Studies, science 
ack: oral and silent—identification of sounds, syl- 


Ingraham-Clark Diagnostic Reading Tests? 


Grades 1-8, 
1. word form and meanings: word likenesses and differences, 
auditory stimuli, visual stimuli, Opposites, similarities, associa- 
tion. 


2. sentences and paragraphs: 
questions, directly stated facts, 
and conclusions, relevant and 
false deductions, selecting and c 
mechanics of organization 


following directions, 
qualified statements, 
irrelevant statements, 
lassifying information, 
and sequence of events. 
Durrell Analysis of Reading Difficulty $ 
1. oral reading comprehension 
2. oral reading recall 

3. silent reading 

4. word recognition 


answering 
inferences 
true and 
form and 


Grades 1-6, 


5. word pronunciation 
6. Spelling 
7, handwriting 


ë Committee on Dia 
s N. Y., 1947-1952. 
f California Test Bureau, Los Angeles, 
“World Book Co., 1933-1937, 


gnostic Reading Tests, 419 w. 


nN 
x 


es 
119th Street, New York 
1929, 


Reading Tests 389. 
Monroe Diagnostic Reading Examination” Grades 1-5. 


1. alphabet repeating and read- 5. mirror reading 


ing 6. mirror writing 
2. word recognition (reading) 7. number reversals 
3. letter reversals and inversions 8. word discrimination (audi- 
4. recognition of word and letter tory) 

reversals 9. sounding words by syllables 


Van Wagenen-Dvorak Diagnostic Examination of Reading Abilities.° 
Grades 4-16. 


1. rate of comprehension 6. grasping central thought 

2. perception of relationships 7. retention of detail 
(analogies) 8. integration of dispersed ideas 

3. vocabulary in context 9. drawing inferences 

4. vocabulary (not in context) 10. interpreting contents 

5. general information 


Inspection of the preceding lists shows that although the several 
instruments differ in some respects in their content, they have much 
in common, with the exception of the Monroe which is concerned 
entirely with the sensory aspects and the mechanics of reading and 
writing, particularly with visual anomalies, and auditory and visual 
deficiencies that interfere with learning and with progress in reading. 
The Monroe test, however, is usually preceded and supplemented by 
results obtained with some of the usual tests of reading achievement. 
Emphasis in the diagnostic tests changes, of course, as the educational 
level becomes higher, the changes depending upon the skills and com- 
petence usually expected in the progressively higher school grades. 


Readiness. Tests of reading readiness serve a still different purpose 
from either of the other two types, already described. Children are 
customarily admitted to the first grade on the basis of age, the assump- 
tion having been that when a child reached a specified age he was 
ready to begin the study of prescribed first grade subjects, Extensive 
investigation of individual differences has revealed, however, that an 
appreciable percentage of children are not ready to receive formal in- 
Struction in reading at the usually prescribed time due to immaturity 


"C. H. Stoelting Co.. 1930. 
Educational Test Bureau, 1939-1940. 


390 Tests of Educational Achievement 


of some specific types of perceptions. Reading readiness tests, there- 
fore, have been devised in an effort to identify children who are not 
yet mature enough to receive and benefit from instruction in this 
fundamental subject. 

The tests described below are represent 
used instruments that have been found val 
ing readiness, 


ative of the more widely 
uable in determining read- 


Gates Reading Readiness Tests." Grade L 


1. following directions in marking pictures (ability to listen, 
understand directions, and remember) 
2. word matching (identification of visu 
3. word perception (selecting one wor 
4. rhyming (auditory perception) 

5. naming letters and numbers 


al word patterns ) 
rd from among four) 


Harrison-Stroud Reading Readiness Tests, Kindergarten and Grade 1. 


1. making visual discriminations 
2. using context 


3. making auditory discrimina- 
tions 


4. using auditory clues in identi- 
fying items 
5. using symbols 


Metropolitan Readiness Tests." Kindergarten and Grade 1, 
1. word knowledge (selecting pictures that correspond to words) 
2: understanding of and response to oral directions (selecting 
pictures in response to sentence-long directions, involving sus- 
tained attention) z 


3. information (same as 2, but more elaborate, involy 


ing more 
vocabulary and names of common objects) 

4. matching (visual perception, involving selection of pairs of 
identical pictures of common objects) 

5. knowledge of numbers 

6. copying (simple geometric forms and less complex numerals 
and capital letters) 


11 Bureau of Publications, Teachers College, Columbia Univers; ; 
12 Houghton Mifin Co., 1950, niversity, 1939. 
1 World Book Co., 1933—1950. 


Reading Tests 391 
Monroe Reading Aptitude Tests.“ Ages 5-6 to 8-11. 


l. memory of orientation of 10. word articulation 


forms ll. speech facility (speed or 
2. ocular-motor control and repetition of a word or 
attention phrase ) 
3. visual memory 12. vocabulary by word associa- 
4. motor speed tion (naming animals, 
5. motor steadiness foods, toys) 


6. auditory discrimination of 13. sentence length and use of 
words language (telling a story 

7. sound blending about a picture) 

8. vocabulary (recognition of 14. motor test of handedness 
(writing name with hand) 

15. laterality of hands, feet, and 
eyes 


pictured objects) 
9. auditory memory 


Tests of reading readiness are based upon observation and analysis 
Of aptitudes and abilities that have been found to be involved in learn- 
ing to read by school beginners. Inspection of each of the subdivisions 
of the listed tests shows that there is extensive agreement among them 
regarding what is to be measured, in spite of somewhat different 
names and emphases. Of the four instruments listed, the Monroe is 
Most detailed analytically: it places much more emphasis upon visual, 
auditory, and motor functions than do others. On the whole, in broad 
terms, reading readiness is measured in terms of sensory development 
and acuity, language development and interest, curiosity about the 
environment (information, vocabulary, numbers), and, to a lesser 
extent, motor control and speed (since learning to write is collateral 
with learning to read). 

These and comparable tests have demonstrated their value, The 
Percentile ratings on the Gates test, for example, indicate whether a 
child may be expected to experience much, little, or moderate diffi- 
culty in learning to read. The Metropolitan provides scores and ratings 
(A to E) that indicate probabilities of successfully learning to read. 
The tests are significantly correlated with later achievement in read- 
ing in the early stages. The Monroe yields, for example, an excep- 
tionally high coefficient (.75) with reading achievement at the end 


of the first grade. 


" Houghton Mifflin Co., 1935- 


Tests of Educational Achievement 
394 


his problems aloud, while the examiner (usually the teacher ) ricer 
the pupil’s responses, step by step. Thus the errors, faulty reasoning, 
and incorrect methods may be detected, a 

Much more work remains to be done in this field of objective test- 
ing before it will have reached the level of development of tests in 
other areas, such as intelligence, some aptitudes, and reading. Perhaps 
a major reason for the lag in the field of arithmetic is the wide varia- 


general use. 


TESTS AT HIGH-SCHOOL AND COLLEGE LEVELS 


At the secondary school level, tests are 
Subjects of study. They 


both adequacy of conte 


the Iowa Placement E 


are tests, very largely, of 
information and fundamental skills; but in some instances, as in Eng- 


lish, they measure comprehension and appreciation, 


ance Examination Boar 
ment tests to be used 


available in social stu 


upon dealing with subject 
they test to some extent the 
basic Principles in new situations, Thus, they 


measures of what has been learned and as 


Tmance with similar Course materials in col- 
lege, 


Since 1937, objective tests of educational 
graduates have been in Process of standardization and are b 
under the auspices of the Graduate Record Office," 
a comprehensive series in t ic fields designed to show the 
level and extent of a student’s general 


matter learned in high 
student’s ability to apply 


achievement for college 


© Princeton, N. J. 


16 Bureau of Educational Research and Service, University of Iowa. 
17 Educational Testing Service, 


Tests of Aptitude in Specific Academic Subjects 395 


constructed for the purpose of examining the student’s knowledge, 
ability to solve problems in general fields of study, and ability to exer- 
cise judgment based upon knowledge of academic material. The gen- 
eral tests cover seven subject-matter fields, plus vocabulary and ef- 
fectiveness of expression, these being: mathematics, physics, chem- 
istry, biology, social studies, literature, fine arts. There are, in addi- 
tion, specialized examinations to test students in their subjects of 
major study. The general and specialized tests have two major pur- 
Poses; they may be used to evaluate the student’s progress during his 
undergraduate career, and they may be used as criteria for admission 
to graduate study, For the latter purpose, the specialized examination 


is regarded as being of particular value. 
TESTS OF APTITUDE IN SPECIFIC ACADEMIC SUBJECTS * 


In the fields of mathematics and foreign languages, at the 
high-school level, a few tests have been constructed which are in- 
tended to predict pupils’ achievement in these subjects. They are apti- 
tude tests in the sense that they use restricted and specialized types of 
Materials in an effort to predict performance in a very limited field. 
Tests of aptitude in specific school subjects are constructed on the 
Well-recognized principle that a sample of performance in a given 
Subject, obtained under controlled conditions, provides significant 
evidence of prospects of learning that subject. These aptitude tests are 
Not to be confused with educational achievement tests. The former 
are prognosis tests, intended for those who have not studied the sub- 
ject; the latter are measures of learning after a period of formal in- 
Struction, . 

In algebra, several tests are available. They include subtests such 
as the following: arithmetic problems; analogies; number series; use 
of formulas; brief lessons, of a relatively simple level, dealing with 
Principles and problems of the kind actually encountered in the first 
Course in algebra. Thus the subtests include materials some of which 
F study, while others are similar to 


Were taught in earlier courses of 


“are classified as aptitude tests, they are being included in 


W Ain: 
Although these ` 
z timately associated with achievement in specific 


this chapter because they are in 
Courses of study at the secondary school level. 
“Iowa Algebra Aptitude Test, by H. A. Greene and A. H. Piper, Bureau of 


Educational Research and Service, State University of Iowa, lowa City; 
Orleans Algebra Prognosis Test, Revised Edition, Yonkers, N. Y.: World 


ook Co, 


396 Tests of Educational Achievement 


those found in the study of algebra; and still others are actual learning 
subject-matter. 

Eo prognosis tests also in plane geometry, con- 
structed on the same principles as those in algebra.” 
parts on: reading geometry content, al 
and arithmetical reasoning, 
ometry followed by test items. 

Aptitude tests in foreign languages likewise 
principle that samplings of actual learning pr 
for predictive purposes.” They consist of short lessons dealing with 
grammatical aspects, vocabulary exercises, and translations, The tests, 


it appears, are devised to predict achievement in the grammar-transla- 
tion type of foreign language course. 


' They include 
gebraic computation, algebraic 
visualization, and brief lessons in ge- 


are constructed on the 
ovide the best criterion 


Evaluation of High School Tests. 

these subjects, and other investigato 
Tange, approximately, from 
have been reported on occ; 
relatively high coefficients, s 
languages maintain tha 


The authors of aptitude tests in 


nmar-translation type, 


and chiefly for the pur- 
tis maint 


ained, also, that pupils who 
» benefit from a “functional 
course in foreign language. This i 


Place to evaluate the teach- 
aluating a test in this 


of study with which 
cerned, 
The principal criticisms of the a 


a and ge- 
ometry are that some do not show sufficiently high reliability, and 
most place too much wei 


field, it is neces- 
the test is con- 


: St 1S to predict potential power 
or level of achievement, the speed factor should be eliminated; or 
available testing techni 5 i 


ain separate 


2 Jowa Plane Geometry Aptitude Test, b 
State University of lowa. Towa City; Orle, try Prognosis Test, Re- 
vised Edition, Yonkers, N. Y.: World Book Co, aan 

*! Foreign Language Prognosis Test, by P, M. Symonds, Bureau of Publica- 
tions, Teachers College, Columbia Uni sity, New York: Luria-Orleans 
Modern Language Prognosis Test, Yonkers, N, me 


yY H. A. Greene and H. W. Bruce, 
‘ans Geome, 


Tests of More Complex Educational Objectives 397 


speeded and unspeeded scores, even though the correlation between 
speed and power is very high (but not perfect, nor nearly so). 


TESTS OF MORE COMPLEX EDUCATIONAL OBJECTIVES 


For the greatest part, the items and content of the educa- 
tional tests thus far discussed are useful primarily, though not solely, 
for testing recall of information or use of certain relatively fixed 
skills. More recently, however, some authors have devised tests which 
are intended to measure performance and educational outcomes on a 
higher level: tests of evaluation, interpretation, critical thinking. An 
example of this type is Wrightstone’s Test of Critical Thinking in the 
Social Studies 2 which includes the following three types: (1) obtain- 
ing facts; (2) drawing conclusions; (3) applying general facts. The 
first part requires the student to select certain items from a group of 
facts in accordance with specific directions. The correct answer is 
given to each question, the task of the pupil being to match answers 
and given facts. In this part of the test, the pupil is required to obtain 
facts from graphs, tables, maps, indexes of a textbook type, and to 
locate information in books, magazines and newspapers in the ways 
required in a library. This is, in short, a test of ability to utilize a 
variety of kinds of materials in order to acquire information, Part 
two, called “Drawing Conclusions from Facts,” is a reading test which 
Tequires the pupil to evaluate a number of conclusions drawn from 
given data. All necessary information is provided, so that recall is 
Not involved. The procedure consists of reading a paragraph, then 
Matching several given statements with it to determine which of them 
is appropriately drawn from the paragraph. Part three, “Applying 
General Facts,” is much the same as part two, for the pupil is required 
to generalize and apply facts presented in a paragraph to be read.” 

The Wrightstone test is open to criticism regarding its content and 
the degree to which it achieves its stated purpose; but it is a type of 
Measure worth noting because it is a departure from the much more 
Common instruments which seek to measure only information. It 


“By J. W, Wrightstone. Published by Bureau of Publications, Teachers 
College, Columbia University, 1939. 

"3 The directions in the test itself will help clarify this part: “Below each 
Paragraph are two sets of statements about the paragraph. In the left-hand 
Column are five statements. Three of these statements will help you to under- 
Stand the references in the right-hand column. Select a statement from the left- 

and column which best explains a reference in the right-hand column.” 


400 Tests of Educational Achievement 


an achievement test, educators, guidance counselors, and psycholo- 
gists must closely scrutinize its manual for evidence of sound stand- 
ardization procedures, statistical (score) validity, and the bases of 
content validity. 

Tests of educational achievement are particularly valuable in the 
primary and elementary grades where they are used to measure pu- 
pils’ basic skills chiefly in reading, spelling, and arithmetic, and in 
extent of vocabulary. For these purposes standardized tests, with 
their norms of performance and their diagnostic means, provide teach- 
ers and others with superior instruments for the measurement of 
pupils’ progress in universal fundamentals. 

In other respects and at higher educational levels, standardized 
tests have been used to excess in thi 
circulated by reputable 
the standpoint of 
others—even thos 


; ze such objectives as the development of 
concepts and attitudes, critical thinking, analysis and synthesis of ma- 
terials, creativity, originality, 


problem solving and the like. With rela- 
tests of education 


tests (e.g., the Graduate Record i 

5 . the Wrightstone, a atson- 
Glaser) present a number of F J odie 
the facts are given, as are t 


& Creative, original, or spon- 
al thinking is reduced, often 
bility to discriminate between 
- While such tests if well con- 


Tests of Proficiency 401 


ceived, are greatly superior to the simple measurement of informa- 
tion, their limitations must be recognized. 

On the whole those achievement tests which assist the teacher, 
the psychologist, or the school counselor in discerning sources of an 
individual’s educational difficulties will be most useful also for clinical 
and guidance purposes. Most others, above the primary and ele- 
mentary grades, may be useful in obtaining an index of progress in 
school subjects, chiefly with respect to amount of information or 
skills acquired. 


TESTS OF PROFICIENCY 


Tests in this category deal with achievement level in occupa- 
tional areas rather than in subjects of school study, They are, in fact, 
specialized achievement tests. For this reason they are designated dif- 
ferently and separately classified. 

Proficiency tests are available especially in bookkeeping, shorthand, 
typewriting, and other subjects in business education. A battery of 
Proficiency tests may, however, be built from existing instruments 
that were originally devised for another purpose. Such batteries have 
been labelled “custom built.” For example, analysis of a particular 
type of office job may show that proficiency in four areas is neces- 
Sary: namely, arithmetical computation, spelling, perceptual speed, 
and memory span (immediate recall). As a first step, it might be pos- 
sible, in this instance, to utilize in combination four existing tests al- 
Teady standardized. But, of course, specific validation studies for this 
Particular situation would be necessary. On the other hand, the nature 
Of the job might be such as to require the development of tests 
adapted to this specific situation. _ 

Another type of proficiency test is the work sample. This is nothing 
More nor less than testing the candidate by having him actually do a 
Piece of the work required by the job to be filled. Examples of this 
Would be a replica of a telephone switchboard, an airplane pilot 
trainer (on the ground), a machine, typewriting and shorthand. To 

ave any merit beyond a randomly selected task chosen by each em- 
Ployer, these work sample tests must be carefully selected so as to be 
Tepresentative of the job, and scoring must be placed on a basis that 
1S specifically defined and minimizes subjectivity. 

A third type is the test of analogous functions. Instead of asking the 
Candidate to produce a sample, he is subjected to tests which measure 


Tests of Educational Achievement 
402 


functions that are analogous to those that are necessary in the pa 
example, if manual speed and precision are required, spee 
ai peg boards, and stylus tracing might be used as a ‘es 
Generally, the available Proficiency tests have not been aa 
ardized in the same sense as tests of intelligence and some o re 
have been. The reasons for this state of affairs are, presumably, tha 


many proficiency tests are devised for limited and specific a 
only, jobs even under the same name (e.g. stenography) vary an 
siderably as to requirements, and it is difficult to obtain large enoug 
and representative enough population samples. Yet, it should be aa 
sible to develop these devices to a higher level of objectivity an 
validity. A few of the tests in this category have been so developed. . 
One of these is the Seashore-Bennett Stenographic Proficiency 
Test,™® which is to be used for selecting, training, and up-grading em- 
ployees. This is a work sample test, consisting of five letters that are 
dictated at three different rates, The testee then transcribes the letters 
in a form that might be mailed out, The final product is scored for: 
(1) neatness and cleanness of typing; (2) arrangement of the letter; 
(3) quality of stroke: (4) typing errors and erasures; (5) errors 1N 
English; (6) changes from the original in wording and meaning. The 
letters to be dictated and transcribed are on records, thus keeping the 


quality and rate of dictation constant for all testees. Samples of su- 
perior and poor transcriptions are provided as 


26 Psychological Corporation, 1946, 


Ez 


P ONAE AATE a 


INTELLIGENCE TESTS AS 
CLINICAL INSTRUMENTS 


the intensive psychological study 
of individuals in which testing, observation, interview, and history 
taking are utilized, in whole or in part. The purpose of such study, 
in some instances, is to determine the causes of each individual’s mal- 
functioning and to prescribe suitable educational and psychological 
measures to deal with the problem. These measures may include edu- 
cational changes and adaptations, manipulation of the person’s en- 
vironment (in one or more of several ways), vocational guidance, or 


Psychotherapy. 
Not all cases comi 


CLINICAL psychology involves 


ng toa psychological clinic, however, are prob- 
lems of maladjustment. They may be individuals who want whatever 
Objective and psychologically valid information might be obtainable 
regarding their general and specific abilities, their interests, and some- 
times, their personality traits. “ee ; 

All types of tests have become indispensable instruments in the 
broad practice of clinical psychology. Every test can be clinical in a 
literal sense, since it helps to analyze an individual's abilities, to ob- 
tain a more nearly complete description of his strength and weak- 
ness, In this chapter We shall deal, however, only with two of the 
Most widely used tests of intelligence—the Stanford-Binet and the 
Bellevue—and_ with several other instruments devised particularly 
for the determination of mental abnormality or deterioration. In a 
later chapter we shall present some tests of personality and their 
Clinical uses, While tests of specific aptitudes and educational achieve- 


404 Intelligence Tests as Clinical Instruments 


ment are often essential in the study of some individuals, these do not 
present the clinical problems and possibilities found in intelligence 
and personality tests; the clinical uses of aptitude and achievement 
tests, especially in cases of educational and vocational counseling, are 
much more obvious and direct. Nor shall we deal further with tests 
for infants and preschool children. These are not used in so wide a 
variety of clinical problems as are the Stanford-Binet and the Belle- 
vue; their organization, furthermore, is such that they readily lend 
themselves to analysis of performance in terms of sensory, 


motor, 
perceptual, language, and social development. 


FACTORS WHICH AFFECT TEST PERFORMANCE 


Before discussing any specific tests, it is necessary to point 
out the factors which can affect an individuals performance on the 
occasion of any psychological examination and of which the clinical 
examiner must be aware. These factors—or sources of error—are of 
two general kinds: intrinsic and extrinsic, By the former, we mean 
those factors which are within the subject himself; by the latter, we 
mean those outside the subject. Lack of IQ constancy, in some in- 


stances, must be evaluated in the light of possible disturbing factors, 
discussed below. 


Intrinsic Factors. Such factors that affect test performance include: 
(1) organic difficulties such as: defective hearing or vision; disability 
or enervation due to malnutrition, localized or generalized infections; 
glandular dysfunctions; acute or chronic illnesses which lower an in- 
dividual’s level of performance; (2) emotional factors as evidenced 
in: lack of interest, lack of seriousness, deliberate deception, nega- 
tivism; inhibition due to shyness or lack of confidence; hyperactivity 
and restlessness; neuroses and the more severe forms of mental dis- 
turbances; (3) language handicaps; (4) speech defects, 


The presence or absence of organic difficulties may be inferred 
from the subject’s history or from 


such as teachers; but their existence 


f organic difficulties, The report 
in such instances, include a 
nd performance which gave 


Factors Which Affect Test Performance 405 


rise to the suspicion or inference. This description must be taken into 
account in the interpretation of the test results. 

As for emotional factors, the psychological examiner must be able 
to discern their presence in the subject’s behavior, or, where the 
Nature of the instrument permits, in his actual performance on various 
parts of the test itself, This latter aspect will be discussed specifically 
in connection with the several tests presented subsequently in this 
chapter. Discernment of negativism, deception, shyness, lack of in- 
terest, and lack of seriousness is a form of subtle clinical insight which 
can be developed only through experience in the actual testing of a 
variety of individuals who manifest these traits in varying degrees, 
and of others who do not manifest them at all, Such contrasting in- 
dividuals provide the necessary basis of comparative evaluations. 


Extrinsic Factors. Factors of this kind affecting test ratings include 
the following: (1) accidental factors, such as errors in time limits, 
broken pencils, distracting noises; (2) scale errors inherent in the 
tests themselves due to imperfect standardization; (3) scoring errors 
due to judgment of the examiner or to the marginal character of the 
Tesponse—that is, a response which is on the borderline between the 
acceptable and nonacceptable; (4) skill of the examiner who must 
Not only be thoroughly familiar with the instrument he is using, but 
who must be able to establish the rapport necessary to elicit the sub- 
ject’s best performance. These sources of error signify that the exam- 
iner must be highly qualified and must know the standardization and 
limitations of his instrument. 

The recognition of factors that can affect test performance is al- 
Ways essential for sound interpretation of test results. It is especially 
Necessary that these factors be given due weight when the psycho- 
fronted with the problem of making a diag- 
a diagnosis should not be made on 
alone; for attempts to use only IQ's 
In diagnosing mental deficiency not only disregard the necessity of 

aving additional information about the individual, but they also re- 
Veal a failure to recognize the probability of “errors of measurement” 


; ae 1 
due to intrinsic and extrinsic factors. 


ae 


logical examiner is con 
nosis of mental deficiency. Such 
the basis of an intelligence test 


ž Somestates for example. unwisely have written into law the requirement 


that a child must have a certain IQ (say, 75) or lower to be admitted into a 


SPecial class for mentally retarded children. 


6 Intelligence Tests as Clinical Instruments 
40 


Tests provide a situation in which the subject is Slee e 
observed as well as being scored in accordance with objective stan : 
ards. In addition to observation and standardized scoring, analysis o 
an individual’s performance on subtests and specific items can often 
yield significant information concerning his mental status and mode 
of functioning. The following sections on the Stanford-Binet and the 
Bellevue scales will be devoted to this aspect of test interpretation. 


THE STANFORD-BINET SCALE 


Uses. In schools, guidance centers, and clinics, the problems 
for which intelligence tests have been and are being widely used are 
the following: diagnosis of mental deficiency and mental retardation; 
determination of mental levels of delinquents; diagnosis of intellectual 
disturbance and deterioration; differential diagnosis (that is, mental 
traits and profiles of various clinical groups); evaluation of the in- 
telligence of maladjusted children; evaluation of the intelligence of 
children with special disabilities in learning; determination of mental 
superiority; educational guidance for children and others who are not 
“problem” cases, as that term is ordinarily used; vocational guidance, 
for which the determination of intelligence level is one essential. 


Functional Analysis. 
Sufficient to find only 


differ functionally in terms of their com 
results on two Stanford- 


for example, each with 
basal year, extent of scatter (see below), and items passed or failed 
at each of the several levels, 

Functional analysis of Stanford-Bin 
apparent because the items are grouped 


The Stanford-Binet Scale 407 


than psychological processes.” It is possible, however, to draw certain 
conclusions, regarding an individual’s strength and weakness, through 
analysis of the test items. For example, levels may be determined in 
the following functions: visual perception of form (three-hole form 
board, copying a square and a diamond); visual imagery (copying a 
bead chain from memory, and paper-cutting test); visual memory 
(reproducing geometric designs); thinking (absurd sentences and 
arithmetical problems); memory span and attention (recall of digits 
and of meaningful sentences); ability with abstractions (verbal simi- 
larities and differences); reasoning (plan of search, picture absurd- 
ities, and picture interpretation); word knowledge (vocabulary); con- 
cept formation (definitions of abstract words). Analysis of successes 
and failures in these or similar terms will assist the examiner in de- 
termining the psychological components of average or superior men- 
tal-age rating, or the deficiencies responsible for an inferior rating. 
Analysis also is valuable in detecting the causes of specific learn- 
ing disabilities, as in reading. For instance, the test items may reveal 
deficient perception and recall of visual patterns, defective visual rec- 
ognition, poor copying and reproduction of form, and short memory 
span—all of which are causes of confusion and difficulty in learning 
to read and to acquire other school subjects. (See Figures 15.1, 15.2, 
and 15,3), When such defects are revealed by the Stanford-Binet 
Scale, a further intensive examination with a special reading test is 


indicated, 
_ As illustration 
Cite the following reports: 


s of evaluations of Stanford-Binet performance, we 


ance extends from the eight-year level to 
His comprehension and reasoning with 
l of his chronological age. Relative to his 
superior in dealing with verbal abstrac- 
tions. Comprehension of problems, analysis of absurdities, and prob- 
lems of logical fact rank relatively high in his performance. He is 
Weak in dealing with number concepts; cannot make change ac- 
Curately. Visual perception and visual memory are inferior to chron- 
ological age expectancy- Memory for verbal materials is somewhat 
inferior to CA level. Vocabulary 1s inferior; S failed to pass at ten- 
Year level; failed rhymes; word association is poor; reads with diffi- 


culty. (IQ about 90.) 


Range of Allen’s perform 
his CA level (10 years). 
abstractions are at the leve! 
average level, S is generally 


g Compare, for example, with the Bellevue scales and the Chicago Tests of 


Primary Mental Abilities. 


Intelligence Tests as Clinical Instruments 


FIG. 15.1 ( 

diamond 

paired Visual-motor functioning. 

The same boy drew Fig. 15.2; CA, 

9-5; Stanford-Binet MA, 8-0; IQ, 

+. Compare this Teproduction with 
Fig. 15,3, 


above, left). Copy of a 
drawn by a boy with im- 


Pres. 15.2 (above, 
of the human figure by a boy with 
impaired Visual-motor functioning. 
, 9-5; Stanford-Binet MA, 8-0; 
10, 84. This drawing, according to 
the norms of the Goodenough scale 
Measurement of Intelligence by 
Tawings), gives this boy a mental 
age rating of 6-3, and an IQ of 66. 
See also Fig. 15.1, 


right). Drawing 


Fic. 15.3 (left), Copy of a diamond 

by a mentally deficient boy. CA, 

3-2: Stanford-Binet MA, 84: 10, 
3. Compare with Big. 15.1. 


The Stanford-Binet Scale 409 


Lois’ test performance was not consistent. She did well on visual 
items requiring form discrimination and identification, but com- 
pletely failed aesthetic comparison. She was unable to reproduce a 
square correctly, though hers was a close approximation; and she 
completely failed to grasp the concept of man-completion. However, 
she was able to copy folding of paper into a triangle and she did pass 
maze tracing. A consistent failure was that of memory—she could 
not remember either digit spans or sentences, though her speech 
difficulty may have been a major influence in the latter, especially. 
She seemed unable to grasp concepts as a whole; she often misunder- 
stood directions and gave the impression that she occasionally made 
correct responses accidentally rather than as a result of understand- 
ing. She was self-critical on the drawing items, which she did pains- 
takingly and with self-appraisal and subsequent erasures. She also 
was critical of her paper triangle and took another piece of paper 
after saying, “that’s wrong” of her first attempt. She did not seem to 
realize her failures in other respects. (IQ about 55.) 


Nancy was a very cooperative subject. She was a little appre- 
hensive when she entered the testing room, but quickly relaxed and 
entered actively into the tasks. She had a long attention span and 
Was not distracted by outside noises. She thought carefully before 
cult questions, and was self-critical, for she sometimes 
thought aloud and corrected herself as she thought. She was rather 
self-confident and failure of items, some of which she recognized her- 
self, did not seem to bother her. She laughed occasionally but on the 
whole was serious most of the time, giving all her attention to the 


test. (IQ about 125.) 


answering diffi 


Terman * offers a psychological rationale which is helpful in inter- 
Preting results obtained with the Stanford-Binet scale, such as re- 
Ported above. He gives, for example, the following: ball and field, 
Practical judgment; rhymes, verbal associations under the direction 
of a guiding idea which inhibits irrelevant associations; absurdities, 
Judgment and critical ability; memory for design, analysis of visual 
Perception and creation of a whole which facilitates recall; arithmeti- 
Cal reasoning, ready and accurate application of knowledge to a given 


Problem, 


Scatter Analysis. A large number of investigations have been con- 
ducted in an effort to determine the extent of deterioration, or deficit, 
in intelligence suffered by each of the several groups diagnosed as 
pam = roe 


* The Measurement of Intelligence, Boston: Houghton Mifflin, 1916. 


o Intelligence Tests as Clinical Instruments 
41 


having a certain form of mental disorder: for example, eoan N 
schizophrenia, manic depressive psychosis, paranoia, Psyehoneuro Fa 
and others. One method commonly used has been to compare 
averages of mental ages and the measures of variation found for 
sample of abnormal persons with corresponding indexes found s 
sample of normal persons. The theory is that inferior status of an ab- 
normal group, as compared with one that is normal, is a result of the 
mental disorder. On the whole, it has been found th 
from “organic” psychoses (e.g., alcoholism, paresis) show the great- 
est loss, or deterioration; those suffering from “functional” psychoses 
(e.g., schizophrenia, paranoia) show less; while those in the psycho- 
neurotic group give no evidence of deficit. The published studies, 
however, do not agree as to the average amount of deterioration in 


each group. Furthermore, within each category of mental illness there 
are very marked individual differences in the degree of deterioration 
(as shown by measures of deviation). 


at groups suffering 


Due to these variations and to the over] 
tions of ratings, the invest 
little clinical v 


apping of group distribu- 
P deficits are of relatively 
agnostic purposes, The investiga- 
luable insofar as they have shown 
that some degree of deteriorati f mental functioning accompanies 
most forms of mental disorder Since clinical practice, however, is a 
als rather than groups, it is necessary to 
ng individual rather than group losses of 


ities of performance of individuals classi- 
ategories, 


mental ability, or peculiar 
fied in each of the several c 

One of these methods i 
distribution of successes 
in the Stanford-Binet scale. The 
scatter: (1) the number 
year up to the highest lev 
as range of scatter; (2) items passed above the indi- 
vidual’s mental-age level and 


-age level. If only one 
* For a concise summary and bibliography on thi 
Deficit,” Chapter 32, by J. Mev. 


Hunt and C, N. 


mee 
S subject, see “Psychological 
Behavior Disorders, Vol. 2, edited 


- Cofer, in Personality and the 
by Hunt, New York: Ronald Press, 1944. 


The Stanford-Binet Scale 411 


method is to be used, the first of these is preferable, since it covers the 
entire range of performance and expresses the situation in the most 
direct and simple manner. 

Many investigators have reported that psychotic persons show 
greater scatter than do normal individuals or the nonpsychotic men- 
tally deficient. Many reports state also that excessive scatter is a use- 
ful diagnostic sign, particularly in the case of “organic” psychoses 
which apparently yield the greatest degree of scatter. 

In addition to general scatter, analyses have been made of “selective 
scatter”; that is, the effects of specific psychotic conditions upon abil- 
ity to pass specific kinds of test items. For instance, schizophrenics 
are said to be less proficient than normal persons in detecting ab- 
surdities, interpreting fables, passing the purse and field test, memory 
for designs, and passing problem questions. On the other hand, they 
are said to be more proficient in arithmetic reasoning. 

These views on both general and selective scatter are tentative. 
r disagreement among the published find- 
might operate in any given investigation: 
inadequate control of mental age and 

being compared, divergent techniques 


There are several reasons fo 
ings, one or more of which 
inadequate control groups, 
chronological age of groups mpared Tg 
of scatter analysis, and errors of psychiatric diagnosis.® 

In view of inconsistent data, we must conclude that at present, 
numerical measures of general scatter on the Stanford-Binet scales 
are of limited use as clinical aids, so far as most individual cases are 
concerned. On the other hand, extreme degrees of scatter have been 
found diagnostically valuable often enough so that analysis of scatter 
on the Stanford-Binet continues as a clinical technique. Extreme 
Scatter should make the examiner suspect one or more of the follow- 
ing: unusual extrinsic factors, emotional or organic disturbance. 

Regarding selective scatter, there is a greater degree of agreement; 
for the published researches deal mainly with schizophrenics, among 
whom there are a number of different syndromes that affect test 
Performance, Other abnormal groups have also been studied. Al- 
though investigators have not been in complete agreement regarding 
the several abnormal groups, all reports do show that schizophrenia 
does involve selective impairment of the mental functions tested. It 


e bibliography on scatter, see D. Rapaport, Diagnostic 


5 For siv 
a a enen chicago! The Yearbook Publishers, 1945, pp. 554 ff. 


Intelligence Tests as Clinical Instruments 
412 


i be expected that the particular functions impaired, and the de- 
aii s ai with the particular type of schizophrenic disorder; 
a “a ay ae as a whole test performance tends to be inferior 
ree eal more markedly in parts requiring practical reasoning 
ai judgment (What is the thing to do when . . 


- ?), in abstract rea- 
soning (arithmetic problems), and in perceptual organization. In- 
ferior performance is revealed 


not only in the lower scores but in the 
quality (e.g., bizarreness, 


irrelevance) of responses and inconsist- 
encies (failing easy items and 


The Stanford-Binet has been used, 


sts of vocabulary, next high- 
performance scales. 

In the cases of and mental defectives—non- 
psychotic atypical &roups—a quite different order or pattern of per- 


formance was found. Their vocabulary and Stanford-Binet scores 
were low, while their perform 


ance test ratings were higher, Some in- 
dividuals in these two groups, however, had relatively low perform- 
ance test ratings; but they, it was found, were least likely to make 
Satisfactory social adjustment, 
It has been reporte 
of those with relat 
dividuals referred 
significantly lower 
having person 
bility). 
Published studies are in general a 


relatively and significantly inferior 
most of the behavi 


most delinquents 


d, also, that among maladjusted children most 


ively very high performance test ability were in- 
to clinics as “delinquents”; while children with 


performance than verbal ratings were mostly those 
ality defects (psychoses, neuroses, emotional insta- 


Poesi w Bijou, “The n n Approach as an Aid to Clini- 
cal “Analysis—a Review, meriç ournal of Mental Deficiency, Vol. 46, 
1942, pp. 354-362; S. W. Bijou and P. Song = Aok 
More Comprehensive Analysis of Mentally Retarded Pre-Delinquent Boys,” 
Journal of Genetic Psychology, Vol. 65, 1944, PP. 147-160, 


The Stanford-Binet Scale 413 


Why these psychometric patterns should prevail has not yet been 
explained satisfactorily. Poor “behavior efficiency” has been suggested 
as one explanation of the pattern found with psychotics and habitual 
criminals. This notion, however, offers little help unless it is inter- 
preted to mean a loss in capacity to make adaptations to new or un- 
familiar situations—such as would be encountered in performance 
tests. The test pattern of these two atypical groups might be due to 
lack of sustained attention and reduction in speed of response, both 
of which generally are demanded more on performance than on verbal 
scales, Verbal tests, on the other hand, particularly at the adolescent 
and adult levels, draw heavily upon information, knowledge, vocabu- 
lary, and habits of thinking, all of which have become rather well 
established, if not entirely fixed, and which, therefore, are likely to 
become less impaired during mental illness or periods of emotional 
instability, 

The fact that the psychometric pattern for the mentally deficient 
Sroup shows their performance-test ratings to be higher than their 
Verbal abilities should not occasion surprise, for studies of this group 
have repeatedly shown that educationally and psychologically one 
Of their major deficiencies is in their capacity to deal with language 
and number concepts (abstractions in general as compared with the 
Concrete), Why delinquents should reveal a similar pattern is un- 
Certain. An explanation may be found in some or all of the follow- 
ng: delinquents are, as a group, at a lower intellectual level than 
an unselected population; they are, as a group, more likely to come 
from disadvantaged homes, poorer cultural and educational back- 
rounds: they may be lacking in motivation on verbal tests, 

There are, finally, some important statistical facts to consider in 
Connection with interpretation of scatter on the Stanford-Binet, The 
first js that the percents of the standardization population passing 
each item at any given age level are not all equal. For example, at 
age 7 (Form L), the percents passing vary from 51 to 70; at age 8, 
from 57 to 67; at age 12, only from 61 to 69. These differences sig- 
Nify that the items at a given age level are not all of equal difficulty; 
hence uniformity of performance is not to be expected. It will be re- 
Called that placement of items was made on the basis of several con- 
Siderations, only one of which was percent passing. 

Other technical factors that might affect scatter are the reliability 
Of each item; or low intercorrelations between certain items, some of 


Intelligence Tests as Clinical Instruments 
414 


i ight make demands upon a more or less specialized ability 
Ta Sitel tion with g); or items wrongly placed in the age-scale. 
a those technical factors may account for part of the scatter 
on they cannot account for all of it, nor for whatever degree of 
a ment exists in the findings for a single clinical group. 

A Aoi the psychological hypotheses offered to account for some 
scatter-patterns are only tentatively advanced, knowledge of the a 
istence of these psychometric patterns in some cases is clinically 


valuable and emphasizes an approach that often has been useful and 
that, therefore, continues to be used, 


Content Analysis. 
increasingly interest 


the number-of-words-in-one-minute item 


as clearly revealed in a sen- 
atic Apperception Test. Other 
e, ambition, unsatisfied needs, 
reveal more or less deeply rooted emo- 


tence completion test a 
psychologists Teport w 
ete. Verbal responses 
tional conditions, 
Word definitions may 
For example: as when “ 
should not go into”; 


nd in the Them 
ords of violenc 
may thus 


reveal experiences 
puddle” 
or “lecture” 
up.” There are also times when 


and centers of interest. 
is defined as “dirty water that you 
as “a lot of yak-yak from a grown- 
definitions will be concentrated in a 


ard, “Intra-Scale 
Training School 
A. Magaret and C. W. 
al, Superior, and Mentally 
l Psychology, Vol. 45; 
» “Test Patterns of Mental 
American Journal of Mental 


ally Defective Children,” The 
Bulletin, Vineland, N. J., Vol. 49, 1952, PP. 36-44: 
Thompson, “Differential Tes of Norm 
Defective Subjects,” 
1950, pp. 163-167; 


ford-Binet Scale,” 


Deficiency, Vol. 51, 1947, pp. 394-396, 


The Stanford-Binet Scale 415 


special area (e.g., sensory) so as to suggest special interests. Some 
give definitions attended by compulsiveness, uncertainty, or over- 
elaboration. 

Values and behavior patterns also are revealed at times. For exam- 
ple, under “Comprehension” (Form L, year 7, item 4), one of the 
questions is this: “What is the thing for you to do if another boy hits 
you without meaning to do it?” A common, and expected, answer is: 
“Nothing. He didn’t mean it.” But at times a boy replies: “Hit him 
back” or “Go home and tell my mother (or teacher).” 

Picture interpretations also may offer an opportunity to gain in- 


Sights into aspects of behavior and personality. The responses to pic- 


tures may be moralistic. or submissive, or hostile, or anxious, etc. Or 


they may be of the ordinary. expected variety. 
Some children who give ready and assured responses to routine 


Materials, like recall of digits and sentences, may be anxious, hesitant, 
and tentative when dealing with items that require their own judg- 
Ment and evaluation, thus suggesting dependence or submissiveness. 

In the test situation it is possible for experienced examiners to 
utilize certain practices that have not been prescribed in the manual, 
Nor even contemplated by the test’s author. As instances in point, the 
three following practices are cited. After the formal testing has been 
Completed, it is sometimes desirable to check on certain failures to de- 
termine whether they were due to lack of capacity, verbal handicap, 
OF failure to understand instructions. 

This may be done by giving actual instruction on certain types of 
items, then testing the individual further on the same types. The re- 
Sults, of course, do not change the subject’s score on the test; but 
the procedure and its results should be noted in the report. For exam- 
Ple, if the child has failed on similarities and differences (Form L, 
Year VIII, item 4), when testing has been concluded the examiner 
Might return to that item, explain the terms and the nature of the 
Problem, and even give the correct answer to one of the four parts. 
He then asks the subject to give the answers to the remaining parts 
to determine if there has been any learning. 

A second kind of “extra-testing” procedure is directed toward 
discovering how far a testee has been able to proceed on an item he 

as failed. as in arithmetical problems. Thus, the examiner returns 
to the problem: he leads the testee through the solution step by step, 
at times suggesting the procedure or even the basic process to be 


416 Intelligence Tests as Clinical Instruments 


used. In this manner, it is often possible to discern the nature and 
extent of a subject’s deficiencies. 

If a child has failed to copy the diamond satisfactorily (year VII, 
item 3), the examiner may want further information on whether the 
inability is probably due to disturbed visual-motor functioning or to 
defective perceptual analysis. He may, then, make three copies of the 
diamond, of varying quality, one of which is a duplicate of the testee’s 
own copy. The testee is then asked to select the “best” one. It has 
often been found that the child whose poor copy is consistent with 
his general mental level will not select the examiner's superior repro- 
duction or will select at random; but a subject whose general mental 
level is higher than that suggested by his own copy can make the 
correct selection, even though he still will not be able to produce a 
satisfactory copy on subsequent trial. 

These are three illustrations of “extra-testing” procedures that 
enable the examiner to prepare a more analytical and richer report. 


Order of Item Administration. The order in which items of the Stan- 
ford-Binet should be presented has been under examination and dis- 
cussion in recent years because clinicians have found that order of 
Presentation in some instances influences performance of n 
subjects. Three methods have been used: (1) standard, 
tive; (2) serial; (3) adaptive. Using the first of these 
presents the items in the order in which they appear in the scale, 
Using the second, he follows through on one type of item, to the 
more difficult ones, until the subject fails: for example, memory span 
for digits, memory for sentences, similarities and differen 
hension, etc, If the third method is used, the ex 
easy items and alternates these with difficult ones 
When employing the adaptive method, the examine 
below the subject’s expected mental age (to ins 
moves up and down in an effort to establish the m 
and the basal levels as early as possible. The Principal justification 
given for the adaptive method is this: some individuals 
aged by a series of consecutive failures when they reach higher and 
more difficult levels; they might, therefore, fail items that they could 
otherwise pass. The adaptive method is intended to scatter failures 
among successes. 


naladjusted 
or consecu- 
, the examiner 


ces, compre- 
aminer starts with 
of the same type. 
r begins at a level 
ure success) and 
aximal (or terminal) 


are discour- 


The sparse experimental data support the use of the adaptive pro- 


The Bellevue Scale 417 


cedure. First, it has been found that there is little difference between 
results found by standard and adaptive methods when a representative 
population or a well-adjusted group are tested. Thus the norms estab- 
lished by Terman and Merrill through standard testing are applicable 
when the adaptive method is used. One study reports a correlation of 
93 between results obtained when the two methods were compared, 
using Forms L and M. The test-retest differences between IQ's were 
just about what Terman and Merrill reported in their standardization 
data: namely, about 5 points. 

Second, with the adaptive method, maladjusted subjects, especially 
the more maladjusted, achieve higher 1Q ratings than they do by the 
standard order of testing. One study reports a mean gain of 11 points 
when the adaptive method was used, as compared with the standard 
consecutive method. 

Insofar as an adapted method of procedure contributes to a fuller 
and more valid understanding of a subject’s performance, without 
violating the basic principles of the scale being used, its use in ap- 


propriate cases is warranted. 


THE BELLEVUE SCALE 

se of analysis, this scale has an obvious advantage 
over the Stanford-Binet because its items are grouped into eleven sub- 
tests, each examining psychological functions which might be sus- 
ceptible to different forms of personality maladjustment, Arrange- 
ment into subtests not only permits the preparation of analytical 
profiles for individuals, but it makes possible the ready analysis of 
subscore interrelationships which may be studied for any bearing they 


might have on clinical use. 

Since 1940 a rather large number of studies have 
iling differences found between normal and ab- 
between the several abnormal groups themselves. 
cerned with differences in extent of scatter 
atterns of subtests, emphasis being 
to be agreement, for example, that 


For purpo 


Clinical Diagnosis. 

been reported, deta 
normal groups, and 
These researches were con 
and with differences between P 
placed upon the latter. There seems 


“A. N. Frandsen et al.. 


“Serial versus Consecutive Order Administration of 
the Sta “Bi Intelligence Scales,” Journal of Consulting Psychology, 
Vol. 4 Tat, oh 316-320: M. L. Hutt, “A Clinical Study of Consecutive 
and Adaptive “resting with the Revised Stanford-Binet,” ibid., Vol. 11, 1947, 


PP. 93-103. 


418 Intelligence Tests as Clinical Instruments 


schizophrenics are relatively superior on the subtests of Information 
and Comprehension, while they are inferior on Object Assembly and 
lowest on Digit-Symbol. One investigator, Rabin, devised a “Schizo- 
phrenic Index”: namely, the ratio of the sum of the scores on In- 
formation, Comprehension, and Block Design to the summated scores 
on Digit-Symbol, Object Assembly, and Similarities.” The justification 
for this empirical index was that it seemed to differentiate between 
schizophrenics, on the one hand, and manics, and normals, on the 
other. This index, however, must be regarded as tentative. 

The foregoing studies were devoted principally to the determination 
of group trends and group patterns, Results obtained from study to 
study were not always in agreement; nor were the results of 


a par- 
ticular report always unequivocal. We mention these investigations 
because they are efforts to make the Bellevue scale more valuable 


analytically and clinically; and, as tests are improved, 
that these methods may yet prove valuable as analytic 

Brown, Rapaport et al., first suggested the intraindiv 
of scatter. This consisted of: (1) calculating, 
the deviation of each subtest score from the mei 
scores; (2) computing the deviation, for each 
verbal subtest score from his verbal subtest mean; 
deviation of each performance subtest score fror 
subtest mean. This approach proved to be prom 
the preliminary study to an extensive and det 
reported below, 

The author of the Bellevue scale took 
clinical investigations reported b 
clinical findings, and prepared 


it is possible 
al techniques. 
idual measure 
for each individual, 
an of all his subtest 
individual, of each 
(3) computing the 
m the performance 
ising and served as 
ailed investigation to be 


note of the analytical and 
y other psychologists, added his own 


a table giving the test characteristics, 
or patterns, of several clinical groups: organic brain disease, schizo- 


phrenics, neurotics, adolescent Psychopaths, and mental defective- 
ness.” As an example, according to this table, the pattern of the 
neurotic group shows relatively superior scores on Information, Com- 
prehension, Similarities, and Vocabulary: relatively low scores on 


Digit Span, Picture Arrangement, Object Assembly, and Digit-Sym- 


°A. I. Rabin, “Test-Score Patter: 
States,” Journal of Psychology, Vol. 
See D. Rapaport, op. cit., p. 553. 


11 See D. Wechsler, The Measurement of Adult Intelligence, 1944 Chapter 
11. 


ns in Schizophrenia and Non-Psychotic 
12, 1941, pp. 91-100. 


The Bellevue Scale 419 


bol; average scores on Arithmetic, Picture Completion, and Block 
Design. The explanation of this and other patterns is, presumably, 
that mental defect, each major type of personality disorder, and 
mental illness are associated with characteristic deficits or losses of 
psychological functioning. Mental defectives are individuals whose 
mental development has been arrested at a relatively low level, though 
not uniformly in respect to all functions. A personality disorder or a 
mental illness produces impairment in the individual’s mental abilities; 
but the resultant losses are not uniform; apparently some functions 
affected than others. Thus, if a psychological test 
is able to measure and portray these differential deficits and losses, it 
can assist significantly in making a clinical diagnosis. 

We must. however, emphasize a caution in this connection. The 
test pattern of each group is not entirely unique; that is, the same 
types of tests are generally failed by more than one clinical group, 
and still other types are generally high for more than one group. For 
instance, psychopaths and mental defectives score relatively better 
on nonverbal than on verbal subtests; in general those suffering from 
organic brain disturbances, schizophrenia, or neuroses score relatively 
low on Digit-Symbol and Arithmetic. There are also some differences 
between these clinical groups, as well as similarities. The tests cannot 
be used as a short-cut to clinical diagnosis; but for one who is familiar 
with the general character of the psychological functions being tested 
and which have been impaired, analysis of an individual’s performance 
on the Bellevue scale, taken together with other evidence, adds to the 


Probability of a correct diagnosis of the case. 


are more adversely 


The most detailed report on the Bellevue scale as a 
ment was presented by Rapaport." His study is 
based upon the examination of 217 clinical cases and 54 control cases. 
The control group consisted of “54 randomly chosen members of the 
Kansas Highway Patrol.” This control group is the weakness of Rapa- 
Port’s research; for it is doubtful if a small number of Kansas Highway 
Patrolmen constitute an adequate control against which test patterns 
of patients should be compared. Furthermore, in order to evaluate the 


Scatter Analysis. 
diagnostic instru 


12 The reader should note that in a particular instance each subtest score is 

high, low, or average with reference to an individual's own average score, not 
` > i 

with reference to average performance of any group. 


18 Diagnostic Psychological Testing, Vol. 1. 


420 Intelligence Tests as Clinical Instruments 


test patterns of the members of the control group it was necessary tc 
establish an adjustment rating for each one. If an individual was well- 
contained, a satisfactory worker, without evidence of maladjustment 
beyond the ordinary occasional difficulties, anxieties, or mood fluctua- 
tions, he received a rating of 1. Those whose behavior was considered 
borderline or maladjusted—in respect to instability, childhood symp- 
toms, impulsiveness, moodiness, etc.—got a rating of 2, Patrolmen 
who were clearly maladjusted were rated 3. These ratings were based 
upon two hours of interview and social, developmental, and occupa- 
tional records, 

We may question not only whether the group itself is adequate 
for control purposes, but also whether the evaluations and ratings of 
the patrolmen’s personalities rested upon a firm enough basis. Never- 
theless, Rapaport’s report on the Bellevue scale is the most exhaustive 
and resourceful one yet to appear. It indicates the clinical Possibilities 
of an intelligence test; it attempts to provide the Psychological ration- 
ale of each subtest, on the basis of which the nature of mental deterio- 
ration or deficit may be more analytically and precisely represented, 
We shall, therefore, briefly present Rapaport’s method, with several 
illustrations for clarification, The method employed was scatter analy- 
sis, essentially in two forms. 

(1) Vocabulary scatter. This is the difference between 
score on a particular subtest and his score on the Vocabulary subtest. 
The reason for using Vocabulary as the base of comparison is that 
rather consistently it has been found to be the Psychological test least 
vulnerable to impairment by personality maladjustment or mental 
disturbance. It is the Vocabulary score, therefore, from which the 
individual's original, unimpaired intelligence level can best be inferred, 
Degree of loss in other functions can thus be derived from the dif- 


ferences between ratings on each of the subtests and the rating on 
Vocabulary. 


(2) Mean scatter. This is the difference 
single subtest and the average of the ratings 
tests.'' The mean scatter, which may be p 


a person’s 


between the rating on any 
on all the remaining sub- 
Ositive or negative for a 


14 In calculating mean scatter, R. 


apaport omitted Di 
This omission, he states, “. | | 


git Span and Arithmetic. 
fact that impairments [of 
c al inical and control groups, that 
their inclusion would have vitia eness of the mean as a 


The Bellevue Scale 421 


given subtest, measures the relationship of a single measured function 
to the average of all other functions measured by the test. It is thereby 
possible to find out whether an individual’s level of achievement and 
functioning on any of the subtests has deteriorated more or less than 
on the remaining ones. 

Mean scatter can be calculated for the entire test, including both 
verbal and nonverbal (performance) subtests, or it can be found sepa- 
rately for the six verbal scores and the five nonverbal scores.”* 

A third procedure—one which supplements scatter analyses and 
is regarded as essential—is the comparison of pairs of subtest scores. 
Any two subtest scores may be compared for the purpose of finding 
out how the subject’s functioning in one compares with his functioning 
in another, For example, how does his retention of information (re- 
mote learning) compare with his memory span for digits (immediate 
recall and attention)? How does his arithmetic score (reasoning and 
habitual responses) compare with his score on similarities (concept 
formation)? i 

Thus, it is possible to obtain in considerable detail an analysis of 
the subject’s achievement levels on the several parts of the scale, their 
interrelationships, and evaluations of general and special mental im- 
pairments, Scatter on the Bellevue scale, it is maintained, E A o 
random, but follows definite rules and is diagnostically differential 
between kinds of clinical and normal groups.” *° 

The profile in Figure 15.4 illustrates the method we have been dis- 
cussing. We quote, also, from the interpretation of this case not only 
to clarify the profile but to emphasize the fact that mechanical scatter 
analysis is inadequate and to show how the psychological rationale 
of the scale may be utilized in interpretation. 

It is true, unfortunately, that not all clinical cases present clear-cut 
patterns of test performance upon which a differential diagnosis may 
be made. In an appreciable number of cases, scatter and subtest anal- 
yses have been found definitely diagnostic; in other cases, the 
analyses, though not conclusive, provide indications of the probable 


ate a a 
ree methods of scatter analysis: verbal, mean, and 


= Rapaport describes th al 
modified mean. The last two are SO much alike that they have been combined 


to show the principle involved. 


1° Rapaport, op. cit., p. 54 A : : 
T From R. Schafer, “The Expression of Personality and Maladjustment in 


Intelligence Test Results.” Annals of the New York Academy of Sciences, Vol. 
46, 1946, pp. 609-623. This report is based upon studies directed by Rapaport. 


22 Intelligence Tests as Clinical Instruments 
42 


Neurotic Depressive 
EF e awn eT, ag Ss AG. N 


Comprehension 
Information 
Digit Span 
Arithmetic 
Similarities 
Vocabulary 
Pict. Arrang, 
Pict. Complet, 
Block Design 
Object Ass, 


Digit Symbol 


ric. 15.4. “Outstanding in the scatter is the great discrepancy between 
the verbal and the performance sub-test scores, and this discrepancy 
have found to be a statistically significant, and theref 
indication of depression, The rationale o 
depression becomes manifest in intellectu: 
of perceptual and associative processes. 
organization and visual-motor coordination required by the performance 
subtests put too great demands upon the slowed-down depressive, Fur- 
thermore, in contrast to the untimed verbal subtests, the performance 
subtests have time limits on cach item and even give extra credit for 
speed. Consequently, depressives not only do not obtain extra credit, but 
exceed the time limit on many items. For this case, item analysis con- 
firms the retardation by showing that, on Picture Completion and Block 
Design, a number of items were failed only because they exceeded the 
time limit. . . . [t]he impairment of Digit Span or attention is also 
striking and reflects the presence of intense anxicty accompanying the 
depression. The mild impairment of Arithmetic is referable to an in- 
ability to meet the time-limits and gain time-bonuses on the items of 
this subtest. In verbalization, much self-depreciation, as well as indirect 
criticism of the test and the examiner, are evident.” R, Schafer, op. cit. 
(By permission.) 


we 
ore diagnostic, 
f this finding is the following: 
al functioning by a retardation 
The relatively complex visual 


The Bellevue Scale 423 


diagnosis; in still others, the scatter profile and analyses are quite in- 
conclusive. In spite of the fact that Bellevue test-analysis does not 
provide a definitive diagnosis in all cases, such analysis is warranted 
because it is practically as efficient as other diagnostic procedures. 
Intelligence test analysis should never be used alone or independently; 
it should be used and interpreted in conjunction with the results ob- 
tained with other psychological tests (e.g., projective methods) and 
other clinical and developmental evidence. 

One specific precaution must be emphasized in connection with 
inferences to be drawn from scatter analysis of the Bellevue, the 
Stanford-Binet, or any other intelligence test; namely, a subject may 
manifest certain deficiencies on a test due to educational, cultural, 
or other environmental deprivation. On the other hand, environmental 
predilections or pressures may enhance certain forms of test achieve- 
ment, In either instance, the diagnostically distinguishing features of 
the intelligences scales will be affected and probably invalidated. Thus, 
before drawing diagnostic inferences from intelligence test results, the 
Psychological examiner must know whether, in the case under con- 
sideration, there have been educational, cultural, or other environ- 
mental factors that might account for any of the ratings and seemingly 


diagnostic criteria obtained. 


Qualitative Aspects of Responses. Scatter analysis and subtest com- 
parisons should be accompanied by a qualitative analysis of the sub- 
ject’s verbalizations and general approach to the test problems. Such 
“an analysis reveals characteristics of the individual's mentality in op- 
eration: excessive doubt, indecision, self-criticism, impulsiveness, 
bizarre notions, obsessions, random guessing, etc. Certain of these 
and other qualitative characteristics have been found frequently as- 
sociated with clinical groups. 

In view of the lack of uniformity of subtest patterns for the several 
clinical classifications, a testee’s mode of approach to a test item is 
highly significant, psychologically, in the interpretation of behavior 
and in the understanding of his personality. Classification and label- 
ling are less important, in an individual case, than description and 
analysis of that person’s behavior and functioning, as shown by sub- 
test results. 

A person’s mode of approach will affect his performance differently 
on each of the various types of subtests, depending on whether the 


24 Intelligence Tests as Clinical Instruments 


ubtest makes demands, for example, on habituated responses or on 
esponses requiring flexibility and reorganization. Individuals with 
rain damage—as a case in point—when tested show rigidity, inability 
© shift attention, inability to change their mode of responding, in- 
ability to ignore superficial or extraneous stimuli, and difficulty in 
organizing material into a pattern or a meaningful logical sequence." 
Inspection of the Bellevue subtests will show that performance on 
them would be variously affected by behavioral traits such as these. 

Failure to respond correctly to a test item is not the only matter of 
importance to the clinical psychologist. While correct and acceptable 
responses to test items are fixed by research and by agreement of 
Psychologists regarding the interpretation of the findings, the incor- 
rect responses are not fixed; nor is the manner of responding fixed. 
To illustrate: current tests have sets of items testing “comprehension,” 
“similarities,” and “arithmetical reasoning.” The first are of the 
“when” or the “why” type; that is, “what is the thing to do when 
-». 2 or “why is it desirable (or necessary) to. . . ?” The sec- 
ond, “similarities,” requires understanding of basic likenesses—for ex- 
ample, “In what way are ____ and alike?” The arithmetical 
problems range from the simplest to fairly complex. A person’s re- 
sponse to any of these may be right or wrong; but his speed and con- 
fidence in answering, or his anxiety or blandness about incorrect 
responses often reveals significant personality traits. When, in a test 
of information, an otherwise intelligent person bl 
Tokyo is in Turkey, or, in a test of similariti 
are alike because both have digestive organs 
responses, personality disturb 

Personality disturbance o 


andly replies that 
es, that a dog and a lion 
» and gives other bizarre 
ance is strongly indicated. 

f the obsessive kind, for example, must 
be considered when a Subject feels compelled to offer four or five 
explanations of courses of action in reply to a “comprehension” item; 
or when he mentions three, four, or more likenesses on some of the 
“similarities” items; or when he gives elaborate and often quibbling 
definitions of words in the vocabulary test. Or if the person persists 


in guessing blindly on test-items that are clearly beyond his level of 
ability, his test-behavior may be indicative of an uncritical impulsive- 
ness. 


*Cf. R. Goldman et al., “Use of the Bellev 
Psychiatry with Particular Reference 
of Nervous and Mental Diseases, Vol, 


ue-Wechsler Scale in Clinical 
to Cases with Brain Damage,” Journal 
104, 1946, pp. 144-179, 


The Bellevue Scale 425 


The foregoing are illustrations of the role of language with respect 
to certain types of items. Some personality patterns or categories may, 
however, be inferred from the manner in which the person deals with 
the test in general and as a whole. A few illustrations follow. 

In the “‘obsessive-compulsive” individual, verbalization of responses 
is over-detailed and doubt-laden. For example, in response to the 
question: “What does ‘stanza’ mean?” the ordinary person who has 
the information would probably say, “A group of rhymed lines,” or 
something similar. A characteristic obsessive-compulsive response, 
however, would be comparable to this: “A stanza of rhymed lines 
forming one of a series of similar divisions in a poem. Two rhymed 
lines form a couplet. A four-lined stanza is a quatrain, a six-lined one 
is a sestet,” etc. One or a few such replies do not warrant the charac- 
terization of the respondent as obsessive-compulsive; but when this 
kind of answer is “idiomatic” of the individual, such characterization 
is indicated. 

Persons in an “anxiety state” also frequently give responses that 
are typical of that group. In general, their behavior is characterized 
by restlessness, apprehensiveness, impaired attention and concentra- 
tion, and bodily expressions (such as tics, najl-biting, fidgeting, cough- 
ing, etc.). This psychological state is manifested, in the test-situation 
where language is required, through difficulty with finding words, im- 
Pulsively blurting out unfinished or unchecked or inappropriate re- 
plies, or fumbling about for adequate formulations. For example, 
when the question is: “How many weeks are there in a year?” the 
anxiety-ridden person may reply: “There are 48 weeks in a year; 
no... let’s see. o» oF is it 50? . . . wait a minute . . . let’s 
see . . . 12 months. - - 4 weeks ina month . . . yes, that’s right, 
48.” Or consider the question: “Why should we keep away from bad 
company?” The person in an anxiety state may give a reply of the 
following sort. “If I was in bad company, I'd get away quick. I 
wouldn’t want to be with them in the first place. They ...er... 


are amawi.. andes 22 ah... get you in trouble... . I 
don’t think a person should be in bad company; that is, . . . er, 
ah . . . if he was brought up right. Anyhow, I'd leave them!” 


The test responses of psychotic persons, also, are quite significant 
for diagnostic purposes; but the subject of psychotic responses is a 
complex one. One aspect of the responses of psychotics may be indi- 


cated here: namely, disorganization of thinking and bizarreness of 


426 Intelligence Tests as Clinical Instruments 


responses, typical of schizophrenics—both, of course, being indicated 
in the language of their answers, For example, the former teacher of 
history who cannot correctly give Washington’s birthday, or the 
clergyman who cannot define the word “vesper” are instances that 
indicate disorganization of memory and loss of previous knowledge. 
Or, on the “bad company” question, when one gives an emotionally 
intense and moralistic response, and explains, irrelevantly, why peo- 
ple should be “good.” Such a response suggests serious impairment 
of judgment. Or in the vocabulary test, when a subject gives impulsive 
replies, such as defining “belfry” as “a kind of bellboy”; or “repose” 


as “to pose over something.” These bizarre answers indicate an impul- 
sive “clang-association.” 


Deterioriation Index. The Bellevue scale presents a procedure for 
calculating the approximate amount of mental deterioration due to 
advancing age.’ This index rests upon the principle that some of the 
mental functions being tested decline more rapidly than do others. 
The difference between rates of impairment, as shown between the two 
types of functions, is said to indicate, in the case of any particular 
individual, his degree of deterioration. There are certain tested func- 
tions that “hold up” with age whereas others do not. The subtests 
reported as “holding up” with age are: Information, Comprehension, 
Object Assembly, Picture Completion, Vocabulary. Those that do 
not “hold up” with age are: Digit Span, Arithmetic, Digit-Symbol, 
Block Design, Similarities, Picture Arrangement. Since some loss of 
tested abilities is normally expected with advancing age, this factor is 
taken into account in calculating the deterioration index. 

As a first step it is necessary to calculate the average loss due to 
deterioration: that is, the total of “hold” scores minus total of “don’t 
hold” scores, divided by the total of “hold” scores, The empirical 


formula is: *° 
Deterioration Index = Hold — Don’t Hold 
Hold 


years of age has the following 
35. Substituting these values in the 


For example, assume that a man 35 
scores: “hold,” 50; “don’t hold,” 


™ Wechsler, op. cit., Chapter 6. 


2 This empirical formula, it is clear, uses the sum of the 
the basis of comparison; the lar 


hold” scores, the greater the app: 


; “hold” scores as 
ger the difference between it and the “don’t 
arent deterioration. 


The Bellevue Scale 427 


formula, the deterioration index is 30 percent. Since only a 5 percent 
loss is expected at this age (according to Bellevue norms), this man’s 
excess or net loss is 25 percent, which is regarded as significant and 
as abnormal deterioration. 

The reader will recall that in the standardization of the Bellevue 
scale, the age norms for each subtest allow for declining scores in 
these measured functions after the age of their measured maximal 
level. By summing the means of the “hold” tests and the means of 
the “don’t hold” tests at successive age levels, it is possible to estimate 
the average or normative loss on test scores to be expected with in- 
creasing age. The difference between the loss shown by an individual 
and the normative loss of his age group may then be calculated to 
her that individual's deterioration index is greater or 
less than expected for his age. If a significant discrepancy is found, 
then the reasons and significance thereof should be sought. 

The principle of the deterioration index is presented here as an as- 
pect of the Bellevue scale, but not because its soundness has been 
unequivocally demonstrated. In fact, published researches on the in- 
dex differ: some support the concept; others do not; still others are 
equivocal, One of the difficulties, again, seems to be that the concep- 
tion of “hold” and “don’t hold” subtests might apply to some age 
groups and some clinical groups, but not to others; whereas with most 
persons, selective rather than over-all loss might be the case. 

The soundness of this method will depend, also, upon high reliabil- 
ity of the test, and upon the adequacy of standardization at all age 
levels. The standardization process for the Bellevue has been de- 
scribed in detail; and its weaknesses have been indicated. It appears, 
therefore, that the deterioration index is useful primarily in those 
instances where loss is SO marked that imperfect test reliability and 
inadequacy of standardization cannot account for the discrepancy. 
Wechsler states that the method has proved effective in practice and 
may be applied with “reasonable confidence” if the test’s limitations 
are given due consideration. Further experimental validation of the 


method is necessary- i 
For the present, compari 


determine whet! 


sons with premorbid test results (when 
available) would be preferable as evidence of loss of mental efficiency. 
Where such data are not available, the index is of limited value and 
should be supplemented by inter-comparisons of subtest scores and 


by analysis for internal evidence of loss. 


428 Intelligence Tests as Clinical Instruments 


KENT SERIES OF EMERGENCY SCALES *! 


These scales, commonly referred to as the Kent E-G-Y, are 
intended for clinical use when the “examiner is . . . in need of a 
simple and informal mental test that can be presented very briefly as 
a preliminary measure.” Performance upon this scale, it is held, is a 
useful criterion in deciding at what mental level to start the more 
formal and thorough examination. 

The series consists of four overlapping scales, each with its own 
norms: Scale A, ages 5-7; Scale B, ages 6-8; Scale C, ages 7-10; 
Scale D, ages 9-14+. Each is very brief, requiring only ten to fifteen 
minutes. The scales consist only of a series of questions on objects and 
matters of common occurrence, The following are examples, 


Scale A: 
Which is larger, a cat or a kitten? 
At what time of year is it very cold? 
What do we use our eyes for? 
Scale B: 
What is a key used for? 
What does the cow give us? 
Scale C: 
What does a baby chick come from? 
What makes it light on a cloudy day? 
Scale D: 


What are houses made of? 
If your shadow points to the northeast, where is the sun? 


Evaluation. The original Kent scale was devised for clinical use 


principally with delinquents and patient 
preliminary screening instrument, During 
fied and shortened for emergency scree 
scales have been constructed. Published 
emergency scales are equivocal. Some repo 


s in mental hospitals, as a 
World War II, it was modi- 
ning. Since then, the newer 
data on the value of these 
rt high correlations with the 


2! By Grace H. Kent. Published by The Psychological Corporation, 1946, See 
by Kent, Oral Test for Emergency Use in Clinics, Mental Measurement Mono- 
graphs, Serial No. 9, 1932, Williams and Wilkins; F. A, Mullen, “Comparison 
of the Revised Kent Emergency Test: 


s with the Revised Stanford-Binet and the 
Kuhlmann-Anderson Tests,” Journal of Psychology, Vol. 15, 1943, pp. 151-163; 


H. P. Hogan, “Comparison of Stanford-Binet and Kent Oral Emergency Scale,” 
Journal of Genetic Psychology, Vol. 58, 1941, pp. 151-159; W. A. Hunt and 
I. Stevenson, “Psychological Testing in Military Psychology. I. Intelligence Test- 
ing,” Psychological Review, Vol. 53, 1946, pp. 25-35. 


A Report Outline 429 


Stanford-Binet, but, at the same time, appreciable IQ differences. Oth- 
ers report only low or moderate correlations, Practically all seem to 
agree, however, that the scales rate superior subjects lower than does 
the Stanford-Binet, while mentally retarded subjects are rated higher. 
It has also been found that a large proportion of persons with behavior 
disorders are ranked significantly lower on the Kent than on the 
Stanford-Binet. This offers a suggestion for possible use of these two 
scales with some clinical cases. 

Any emergency, quick-screening test necessarily must be relatively 
coarse; and as such its results will be much less reliable than those 
obtained with longer, more varied, more penetrating tests. The use of 
a very brief test is warranted only in an emergency. This is emphasized 
by Kent, who intended the scales to be employed only as one unit in a 
battery. They are not intended as a complete examination and should 


not be used as such.” 


A REPORT OUTLINE 

The following is an outline used in teaching graduate students 
to prepare reports of tests of intelligence administered by them under 
supervision. This report outline will help to clarify the weight and sig- 
nificance given to qualitative aspects of test performance. 


PSYCHOLOGICAL EXAMINATION 
Date: 


Name: (last, first, middle) 
g., 8-7) Date of Birth: 


Age: (years and mos., ©. 
Referred by: 


Development of the report should follow this general outline: 


I. Introductory statement 
a. By whom tested 


b. Why tested Pre ‘ 
II. Give a summary statement of the child’s (or adult's) reaction to 


the test situation, the examiner, etc. 
III. State the name of the test given, In full. 
IV. Test results indicated by: 


22\A useful summiaty of psychological examining in clinical situations will be 
found in J. W. Carter, Jr- and J. W. Bowles, Jr.. A Manual of Qualitative 
Aspects of Psychological Examining, Clinical Psychology Monographs, No. 2, 
1948. É 


130 Intelligence Tests as Clinical Instruments 


Stanford-Binet Wechsler-Bellevue 
a. Mental Age a. Verbal IQ 
b. Intelligence Quotient b. Performance IQ 
c. Basal Age c. Full Scale IQ 
d. Terminal Age d. Descriptive Classification 
e. Descriptive Classification 


(average, superior, etc.) 
V. Test Evaluation 


a. Indicate the performance of the person with regard to: 
(1) verbal material 
(2) non-verbal material 


b. Compare MA (or IQ in W-B) with achievement on Vocabu- 
lary test score 


c. Point out quantitative and qualitative aspects of test 
(1) point out strengths 
(2) point out weaknesses 
d. State your evaluation of the 


performance of the person 
VI. Summary statement 


The quantitative analysis can be derived from knowledge of which 
psychological functions the various test items are sampling, and by 
comparisons of how well the subject handles the various types of test 
items. 


The qualitative evaluation may be derived from the following: 


1. Reaction time: Were responses delayed, blocked, irregular? 
Was there any indication of negativism? 
Were the responses given quickly or impulsively? 
2. Responses: Are they nonsensical, immature, childlike? 
Are they good; are some better than others? 
Is there confabulation? 
Are there peculiarities of speech? 
3. Depth of responses: Are they popular and surface responses? 
Is the subject profound or does he attempt 
to be profound? (Judge this by compari- 
son with information about subject as 
well as by content of responses. ) 


4. Self references: Is the question (or the answer) referred to the 


self? 


A 


Report Outline 43 


Are the responses in terms of his own or immedi- 
ate experiences or in terms of someone else’s 


experience? 
5. Evidence of confusion or doubt: Do questions have to be re- 
peated? 
Does the subject change his an- 
swers? 


6. Verbalization: 


7. Organizational methods: 


8. Adaptability: Does he shift r. 


9. Motor coordinatio 


10. Effort and co 


11. 


Is the question misunderstood 
or misinterpreted? If so, in 
what way? 


Is the subject verbose? 
Is he spontaneous in responding? 
Is he careful or over-meticulous? 
Does he plan readily? 
Does he generalize readily? 
Is there evidence of perseveration? 
apidly from one test to another? 
What is his level of interest? 
n: Are fine and gross movements skillful or 
awkward? 
Can he execute complex bilateral move- 
ments? 
ncentration: Is he cooperative? 
Does he try hard? 
Does he have trouble in attending? 
Mood: Is he easily upset, irritable, emotional, argumentative, 
stuporous, elated, happy, sad or depressed? 


Does his mood change during the testing situation? * 


2 Tami to my former graduate student and assistant, Dr. Joanna 
m indebted the development of this report form. 


Byers, who collaborated on 


436 Tests of Mental Impairment 


devised by Hanfmann and Kasanin‘ will be described. All of these 
are based upon the principle that emotional disturbances and pers 
sonality disorders interfere with thinking processes, particularly with 
ability to form abstract concepts. The purposes of these tests are, 
therefore, to help the psychologist observe the subject’s thought proc- 
esses and to discover the extent to which maladjustment or mental 
illness has impaired his conscious thinking as revealed in efforts to 
solve problems requiring the formation of concepts. 

In particular these tests are intended to evaluate the subject’s ability 
to deal with objects and situations on the abstract or conceptual level 
as compared with the concrete. Ability to form concepts implies con- 
scious reasoning at the abstract level; that is, transcending the imme- 
diate specific sensory situation, abstracting the common property 
from particular instances, analyzing and synthesizing, shifting from 
one aspect to another, keeping in mind several aspects simult: 
planning ideationally, and self-criticism. An individual’s be 
the concrete level, on the other hand, 
The individual is then unreflective; he ri 
given thing or situation as something uni 
object or situation as one instance of ag 


aneously, 
havior at 
lacks these characteristics. 


esponds to the immediately 
que; he does not perceive an 
eneral class or category, 


This has been devised to deter- 
opy colored designs with the use 
hus: one side blue, one red, one 
and one white-red, the colors of 
. The twelve designs to be copied 
Kohs series. 
test, reproduction of the designs 
oncrete or an abstract approach. 
ct merely perceives the model as 
alysis or reflection, to copy it. In 
erceives the design reflectively and 
mploying deliberate analytical rea- 
ry of the relationships of the parts of 
the principles underlying its construction ). 
toduce the design Correctly on the first at- 
ents the same design in a graded series of 


E. Hanfmann and J. Kasanin, “C 
Nervous and Mental Disease Monogra 


soning directed toward a discove 
the designs (presumably, 
If the subject fails to rep 
tempt, the examiner pres 


onceptual Thinking in Schizophrenia,” 
phs, No. 67, 1942, p. 115. 


Tests of Concept Formation 437 


modified forms, each of which is less difficult to apprehend than the 
preceding form. Each step is intended to facilitate the solution 
through: (1) enlargement to actual block-size; or (2) emphasis upon 
delineation of part relationships; or (3) actual use of block models. 
The subject is credited with having made a solution at the abstract 
level only if he succeeds with an original without the aid of the 
graded series of modified and 
simplified designs which are aids 
of a concrete nature. The authors 
state, “It would be absolutely er- 
roneous to suppose that these 
aids help to initiate a process of 
abstraction in an individual who 
lacks the prerequisites for the act. 
These aids simply render that act 
[process of abstraction] unneces- 
Sary.” § 

Abnormal subjects, it was 
found, were unable to benefit 
from the presentation of graded ric. 16.1. From the Goldstein- 
concrete aids, whereas normal Scheerer Cube Test. The subject is 


subi A nefit. The one ‘quired to construct this pattern 
bjects did be from a set of blocks, variously col- 


group were unable to learn; the creg, Psychological Corporation. 
other group did learn to solve the (By permission.) 

problem at the level of abstrac- 

tion. For the normal group, the aids are said to be a means of learning 
to succeed at the abstract level on succeeding designs. For abnormal 
persons, the aids are concrete presentations without possibility of 
transfer value to subsequent situations. If the subject being tested can- 
not benefit from the modified and simplified aids, impairment of ab- 
stract behavior is indicated. 

There are no objective criteria for scoring performance on this 
test. Experience in using it with normal and abnormal groups, how- 
ever, is expected to provide the examiner with the qualitative criteria 
on the basis of which impairment or nonimpairment of abstract be- 
havior will be shown. There are, however, degrees of impairment to 
be inferred from degrees of concreteness of behavior. Subjects who 


“Goldstein and Scheerer, op. cit., p. 55. 


438 Tests of Mental Impairment 
2 


need only the first of the facilitating series (enlargement of design) 
are regarded as less concrete in their behavior than those who need the 
subsequent steps (delineation; and finally, an actual block-model). 

This cube test is considered by its authors to be suitable for study- 


ing impairment of abstract behavior in cases of mental deficiency, 
brain lesions, and dementia praecox.” 


The Gelb-Goldstein Color Sorting Test. This is intended to examine 
the subject’s behavior at the abstract or the concrete level on the basis 
of his ability or inability to sort colors according to definite color 
concepts. In one test, woolen skeins of different hue and tint are pre- 
sented at random, there being twelve different shades of each color 
hue. The subject selects one and is then asked to pick out all others 
that go with it. In the second test, three skeins are presented; two are 
of the same hue but different in brightness and saturation; the third is 
of a different hue but the same in brightness as one of the first pair. 
The subject is expected to make a selection either according to hue 
or brightness. The third test involves the presentation of a series of 
six samples of the same color scale, from lightest to darkest red, 
and a second series of different hues but equivalent brightness, The 
subject is expected to perceive the common quality within each series. 
In the fourth test, the subject is asked to select all reds or all greens. 
In each of these four tests, the subject is asked to state the reasons for 
his groupings. 

In the abstract approach, the subject treats the colored woolen 
skeins “. . . not as ‘given things,’ but as representative of their basic 
color hue.” That is, the subject operating at the abstract level either 
selects all specimens of a given color without regard to differences in 


brightness, intensity, purity, etc., or he selects various colors of the 


same brightness, or same intensity, etc. In either case, the subject has 
discerned and used a concept based upon a common quality; and in 
so doing he has transcended the immediate sensory aspects of the 
situation. That is, he has used abstraction. In behavior at the concrete 
level, by contrast, the subject does not actually sort on the basis 


of a conceptual or categorical relationship; he merely employs a 
matching procedure. 


The authors have found that abnormal persons with functional be- 
“A statement of evaluation will follow the presentation of the five tests by 
Goldstein and collaborators. 


Tests of Concept Formation 439 


havior disturbances are incapable of assuming the abstract approach. 
In varying degrees, the abnormal subject appears to be unable to shift 
from concreteness of matching to the abstractness of categorization. 
This color-sorting test is not scored numerically. The examiner, on 
the basis of his own interpretation, simply records the subject’s per- 
formance as either concrete or abstract. The authors of the test, of 
course, provide specimens of both concrete and abstract behavior. 


The G.G.W.S. Object-Sorting Test." This consists of a group of 
about thirty objects common to one’s everyday experiences, there 
being one group for males and one for females.” Its purpose is “to 
determine whether the subject is able to sort a variety of simultane- 
ously presented objects according to general concepts.” Ability so to 
classify is evidence of the abstract approach; inability is evidence of 
the concrete approach. Classifications may be made on the basis of 
use (e.g., tools), color, form, materials, situational membership (e.g., 
implements for setting a dinner table), and pairings.” 

The test employs several approaches. First, the subject is asked to 
pick out any object he cares to. After doing so, he is directed to select 
all other objects that he believes can be grouped with his initial selec- 
tion, The examiner then himself selects some of the articles—one at 
a time—and again asks the subject to pick out others that belong with 
each initial item. Second, the subject is given all the disarranged arti- 
cles and is told to place them in groups of his own devising. After the 
subject has made his own classification, the examiner urges the subject 
to find still other kinds of groupings. Third, the examiner himself 
makes several groupings representing still different categories of selec- 
tion (e.g., color, form, toys) and asks the subject to state the concept- 
basis upon which the classifications have been made. In all of the fore- 
going approaches to conceptual classification, the subject is asked to 
explain what he has done and why he has done so, for the purpose of 
ascertaining whether he has consciously proceeded on a conceptual 


level of behavior. 3 x 
Performance on this test, like the others in the group, is evaluated 


1 By K, Goldstein, A. Gelb, E. Weigl, M. Scheerer. 

l For example: male’s includes toy hammer, plate, pipe. matches, knife, 
fork, apple, screw-driver, nails, candle, sugar; female’s includes pencils, letter- 
opener, knife, fork, napkin ring, scissors. 

The inclusion of “pairings” as a form of classifying is of very doubtful 
validity; for pairing of two identical articles is nothing more than matching. 


440 Tests of Mental Impairment 


entirely on a qualitative basis, no norms or scale of performance be- 
ing provided. The authors do report, however, that children of eight 
and nine years have been able satisfactorily to perform these tasks of 
classification. Brain-injured persons and some types of schizophrenics 
could not themselves make the abstract classifications, thereby reveal- 
ing an impairment of mental functioning. Mentally deficient, also, 
were unable to make the required classifications; but their inability in 
this respect is quite within expectation, since very limited and arrested 
ability to deal with the abstract is a characteristic of that group. 
Rapaport and his colleagues have devised a scheme for scoring this 
test." Their work indicates that it is possible, by means of scores, at 
least to facilitate an over-all survey of the subject’s responses. No total 
score is derived; each of the several approaches (or parts) of the test 
is scored separately. The scoring is based upon both the act of sorting 
and the verbalization; namely (1) adequacy of sortings and verbaliza- 
tions: that is, the degree of deviation from the norm of that item; (2) 
conceptual level of verbalizations: that is, whether the subject’s defi- 
nition is on the abstract, functional, or concrete level; (3) concept 
span: that is, the inclusion of all objects that belong in a well-defined 
category (the selections being neither too restricted nor too inclusive) , 
utilizing a balance of the inductive and deductive 


processes. Although 
this plan of scoring does not h 


ave the degree of objectivity of stand- 
ardized tests of intelligence and specific aptitude, it is, nevertheless, a 
distinct contribution in that it attempts to formalize evaluation of per- 
formance by selecting and defining criteria. 

The Weigl-Goldstein-Scheerer Color Form Sorting Test. This is 
basically the same as that of the other Goldstein tests. Here the sub- 
ject is required to sort a variety of differently colored geometric fig- 
ures according to color or form. There are four equilateral triangles, 
four squares, and four circles, In each of these forms, one is red, one 
yellow, one green, one blue, while the reverse sides are all white. 
Conceptual behavior is indicated by the subject's ability to arrange 


the figures on the basis of color or form and to verbalize hi 
confirm its conceptu 


sible to judge from t 


s act to 
al character. It is sometimes difficult or impos- 
he act itself whether an individual’s sorting is at 
13 Diagnostic Psychological Testing, Vol. 1, pp. 401 ff. See also F. Boyd, 
“A Provisional Quantitative Scoring with Prelimi 


us nary Norms for the Goldstein- 
Scheerer Cube Test,” Journal of Clinical Psychology, Vol. 5, 1949, pp. 148-153. 


Tests of Concept Formation 441 


the concrete or abstract level. Verbalization of the act is, therefore, 
essential. 

Concrete perception and behavior are generally characterized by 
certain definite aspects, such as the following: a strong tendency to 
build patterns following structural lines; an inability to account for 
a grouping or to grasp the meaning of sorting; dependence upon 
sensory aspects and an inability to shift voluntarily from one sensory 
impression to another; inability to generalize from one performance 
to another. 

Inability to approach the test at the abstract level is indicative of 
disturbance of cortical functioning. Here again, the examiner’s eval- 
uation of a subject’s performance is qualitative, there being no scor- 
ing norms. 


The Goldstein-Scheerer Stick Test. This is intended to examine the 
subject’s ability to: (1) copy relatively simple geometric figures com- 
posed of sticks which are 3.5 inches and 5.5 inches in length; and 
(2) reproduce these same figures from memory, after exposure of 5 
to 30 seconds. The sequence in which the thirty-four figures are 
presented represents something of a scale in terms of number of 
sticks involved and increasing complexity of the model. 

As in the other Goldstein tests, the subject is asked to explain his 
performance, in order that the examiner may evaluate it as being 
concrete or abstract. If the subject’s performance and report indicate 
that he has perceived and responded to “purely directional features 
in space”; if the subject perceives the figures as “mere configurations 
of spatial direction in a detached, purified sense”; if to the subject 
the figures “imply both space and direction as entities in them- 
selves,” then the performance is on the abstract level. For, in that 
event, presumably, the figures bear “no direct reference to a tangible 
life situation.” Apparently, at the abstract level, the subject is ex- 
pected to survey the pattern, and analyze it into its parts and part- 
relationships (space and direction). 

Reference to the stimulus-figure as being associated with or repre- 
sentative of an object in actual experience constitutes a concrete 
response. Thus, to reproduce the stimulus-figure F correctly and call 
it an “F,” or the figure A and call it a “roof,” is to perform at the 
concrete level. 

In this stick test, more than in the others of Goldstein, the exam- 


442 Tests of Mental Impairment 


iner is required to exercise his psychological insights in often making 
a subtle distinction as between a concrete or an abstract response. 

The authors have found this test to be “particularly suited for 
cases with greater mental defectiveness or deterioration.” 


Evaluation of the Goldstein Tests. The authors of this series of tests 
do not present the type of statistical information upon which evalua- 
tions are usually based. In fact, there has been no attempt on their 
part to standardize their tests with respect to the usual criteria of 
validity and reliability. Their emphasis is placed entirely upon qualita- 
tive evaluation of the subject’s responses to the problem situations as 
an aid in the diagnosis of mental impairment or of mental arrest. There 
is no doubt, however, that the clinical value of the Goldstein tests 
would be enhanced if quantified or scaled ratings could be obtained, 
not necessarily to obtain percentile or standard scores, or the like, but 
to facilitate an overall estimate of an individual's responses and to 
increase the objectivity of the tests’ interpretation. At the present time, 
these devices appear to be valuable in the hands of an experienced 
psychologist for the purpose of detecting cases of marked intellectual 
impairment or arrest through the presentation of problems involving 
tasks of abstraction which are rather simple for persons of average 
ability who are functioning normally. In fact, some parts of this series 
of tests are at a difficulty level where most children of seven or eight 
years, and in other parts at nine years, can achieve a solution at the 
“abstract” level. Hence, in most instances, the failure of an adult to 


offer a solution, at the abstract level, with not too much diffi 
well be regarded as a signific 
and efficiency, 


culty, may 
‘ant symptom of loss of intellectual level 


The Hanfmann-Kasanin Test. This test consists of twenty-two 


blocks, each being in one of five colors, six shapes, two heights, and 
two widths. The problem for the subject is to discern how the blocks 
may be divided into four categories; namely: tall-wide, flat-wide, 
tall-narrow, and flat-narrow figures. 

The subject is shown the entire set of blocks, randomly arranged. 
The examiner selects one as a sample and directs the examinee to 
pick out all others that are of the same kind. It is obvious that the 
subject might at first make his selection according to color 


size, Each type of block has a nonsense name on the bo 
cealed from the subject (e.g., bik for flat-wide). After each group- 
ing, the examiner shows the subject one of the wrongly selected forms 


, Shape, or 
ttom, con- 


Tests of Concept Formation 443 


by revealing that it carries a different name. The procedure is to con- 
tinue this kind of aid until the subject discovers the predetermined 
classification, if possible. 

Performance is analyzed and scored in respect to interpretation 
of the task, nature of the attempts at solution, and discovery of the 
correct solution. In each of these, three levels of performance are 
distinguished: the primitive, the intermediate, and the conceptual, 
these being scored 1, 2, and 3, respectively. The principle of the scor- 
ing method may be generalized by saying that it is based upon the 
nature of the subject’s approach to the problem, his ability to con- 
ceptualize, and his ability to verbalize his performance. 

v 


Evaluation. While in the development of this test its authors were 
concerned principally with schizophrenics and their clinical differentia- 
tion, Rapaport found it clinically useful in obtaining evidence of the 
subject’s modes of thinking and response to difficult and frustrating 
problem situations, rather than for diagnostic differentiation." For 
example, different individuals may show the following modes of re- 
sponse in varying combinations and degrees: fluidity (lack of direc- 
tion); flexibility (varying the approach, but keeping the end in view); 
rigidity (resistance to modification of behavior); persistence (con- 
tinuity of behavior). These qualitative descriptions, therefore, ob- 
tained with the Hanfmann-Kasanin test are of the kind that are valu- 
able as a supplement to numerical ratings in obtaining a more nearly 
complete description of a person’s mental functioning. 

Although the authors themselves used a scoring plan, the weights 
of the assigned values have not been experimentally determined. They 
are, rather, arbitrarily assigned scale-values; they may, therefore, be 
More properly designated as numerical indicators or identifications. 
AS such they are useful in obtaining an over-all evaluation of an in- 
dividual’s performance on the test. 

The problem-situations presented in the Hanfmann-Kasanin test 
are at a more difficult and higher level of abstraction than those in the 
Goldstein series. Since these problems make greater demands upon 
Subjects of higher mental levels, they may reveal deterioration in con- 
ceptual thinking that would still be unapparent in the subject's long- 
established responses for meeting familiar situations and for dealing 
with familiar problems. The Hanfmann-Kasanin test, like the Gold- 
Stein series, provides a means of observing behavior in a controlled 


1 Rapaport, op. cit., Chapter 4. 


Tests of Mental Impairment 
444 


ric. 16.2 Hanfmann-Kasanin test blocks. (C. H. Stoclting Co. 
By permission.) 


situation and of obtaining information of some significance to add to 
other psychological data and information. 


THE HUNT-MINNESOTA TEST FOR 
ORGANIC BRAIN DAMAGE © 


This test has been devised as 


an aid in clinical detection of 
organic brain damage, 


in the case of individuals who are sixteen 


1 By H. F. Hunt. Published by University of Minnesota Press, 1943. See 
P. E. Meehl and M. Jeffery, “The Hunt-Minnesota Test for Organic Brain 
Damage in Cases of Functional Depression,” Journal of Applied Psychology, 
Vol. 30, 1946, pp. 276-287; H. Juckem and J. A. Wold, “A Study of the 
Hunt-Minnesota Test for Organic Brain Damage at the Upper Levels of 
Vocabulary,” Journal of Consulting Psychology, Vol. 12, 1948, pp. 53-57. 


The Hunt-Minnesota Test for Organic Brain Damage 445 


years of age or older. The instrument consists of three major divi- 
sions: the vocabulary test of the 1937 Stanford-Binet, which has been 
found relatively insensitive to brain damage; six memory and recall 
tests, which are considered to be sensitive to brain damage; and nine 
interpolated tests which serve as “validity indicators.” 

The six deterioration tests involve memorizing and retention of 
paired designs (presented visually, of course), and paired words pre- 
sented orally. Both types are used to test immediate and delayed re- 
call. A series of paired designs is exposed, without interruption, for 
six seconds each, after which the subject is shown one of each pair, 
in sequence, and is required to identify the design associated with it 
(immediate recall). In the word test, a series of ten pairs of words 
is read; after this is done, the first word of each pair is given singly, 
and it is the subject’s task to name its paired word (immediate re- 
call), There are three tests of designs and three of words. 

The vocabulary test, as in the case of other clinical instruments 
(e.g., the Babcock) is regarded as being relatively stable, as holding 
up against deterioration. The vocabulary score, which is the number 
of words correctly defined according to the Stanford-Binet standards, 
is taken as the base for the determination of deterioration score, from 
which presence or absence of brain damage is inferred. 

The interpolated tests consist of the following: information; nam- 
ing the months of the year; counting from 1 to 20; counting from 
3 to 30 by 3's; tapping on the table every time the number 3 is read 
in a long series of digits (attention test); counting backwards from 
25 to 1; repeating digits backwards; naming the months in reversed 
order; subtraction of 3’s from 79 to 1. These items are included for 
the following given reasons; as “validity indicators,” since persons 
unable to perform these are too uncooperative, or too disturbed, or 
too deteriorated to be tested; hence testing them will not yield valid 
results, Critical scores are given for each of the interpolated tests, 
these having been reached or exceeded by ninety percent of the 
brain-damaged persons used in standardizing the test. The test’s au- 


16 Subsequent to publication of the test, Hunt reported that its maximal 
validity is for persons between the ages of 20 and 55 years, whose vocabulary 
scores are not at either one of the extremes of the distribution (less than 12 
words nor more than 32). See H. F. Hunt, “A Note on the Clinical Use of the 
Hunt-Minnesota Test for Organic Brain Damage,” Journal of Applied Psy- 
chology, Vol. 28, 1944, pp- 175-178. 


446 Tests of Mental Impairment 


thor reports that individuals whose scores fall below the critical levels 
in three or more of the interpolated items cannot be validly tested. 
The interpolations also provide the means of filling a time interval 
after which retention of the same paired designs and paired words 
(delayed recall), given at the beginning of the examination, is tested. 


Evaluation. This test was standardized upon a small number of pa- 
tients (only 33) in several state hospitals who had been diagnosed as 
suffering from organic cerebral damage, excluding congenital condi- 
tions, birth injuries, and childhood brain injury. In age they ranged 
from 16 to 70 years. The control group consisted of 41 cases in state 
institutions, neuropsychiatric wards, or war veterans’ hospitals, but 
who were not diagnosed as cases of organic brain damage. 

When twenty-five cases from each group (brain-damaged and 
control) were equated on the basis of only their vocabulary scores, 
striking differences were found on the deterioration tests, and much 
smaller differences on the interpolated items. On the deterioration 
tests, using raw scores, the critical ratios of the differences between 
the two groups was 6.8 or more, whereas on the interpolated tests 
none of the critical ratios was greater than 3.8. The total Overlapping 
of deterioration scores of the two groups was 50 percent, while total 
overlapping of interpolated scores was 90 percent. These data indi- 
cate that the former tests are much more discriminative between the 
two groups than the latter, and presumably, therefore, much more 
sensitive to effects of deterioration. The fact, however, that there is a 
50 percent total overlap indicates that the test results must be viewed 
not in isolation, but in conjunction with other evidence in each case, 

The correlation between deterioration scores and vocabulary scores 


was found to be —.51; between age and vocabul 


ary, .07; between age 
and deterioration score, 


—.37. Multiple correlation of deterioration 
score with age and vocabulary was —.65, 


This instrument, like all others in this d 


cal testing, needs to be more clearly validated by further application 
and experimentation; for it appears that in some groups it does not 
differentiate as well as it should, too large a percentage of normal 
subjects having got scores which would Suggest brain damage." It is 
well, therefore, to view this test at present—as suggested by its author 


Minnesota Test for Organic Brain 
Vol. 30, 1946, pp. 271-275. 


ifficult area of psychologi- 


1 R. F. Malamud, “Validity of the Hunt- 
Damage.” Journal of Applied Psychology, 


The Bender Visual-Motor Gestalt Test 447 
—as additional evidence offered in support of or contrary to a clini- 
cal suspicion of organic brain damage. At the same time, the more 


extreme positive deterioration scores have more diagnostic value. 


THE BENDER VISUAL-MOTOR GESTALT TEST ` 


This test consists of nine figures which are characterized chiefly 
by their patterning, or configuration (i.e., their gestalt). The subject 
is simply instructed to copy each 
figure, without time limits, while 
it is before him. The test, clearly, 
is not one of visual memory Or 
imagery; it is, rather, one of per- i 
ception and of visual-motor func- 
tioning. 

The figures used were devised 
by Max Wertheimer—one of the 
founders of the Gestalt school of 
psychology—in his experimental 
work on perception. The under- 
lying principle utilized in this 
Bender test, as expounded by 
Wertheimer and others, is that or- 
ganized wholes (structured units) 
are the primary forms of percep- a > 


tion in human beings. Disturb- a 


ances of perception (loss of inte- T 
grative perception), therefore, ric. 16.3. The Bender Visual- 
percep Motor Gestalt Test. A Visual- 


i s athological 
might be psychop Motor Gestalt Test and Its Clini- 


manifestations. Perceptual be- 
havior is regarded, in the test, as 
involving sensory reception of the 


cal Use. Research Monograph No. 
3, American Orthopsychiatric As- 
sociation, 1938. 


figures, interpretation at the cen- 
tral levels of the nervous system, and motor performance (drawing). 


This total process of perception and reproduction can be distorted by 
neural injury, by emotional maladjustment in the perceiving individ- 


IL, Bender, A Visual-Motor Gestalt Test and Its Clinical Use, Research 
Monograph No. 3, American Orthopsychiatric Association, 1938; F. Y. Billings- 
lea, The Bender-Gestalt: An Objective Scoring Method and Validating Data, 
Clinical Psychology Monographs, No. 1, 1948. 


448 Tests of Mental Impairment 


ual, and by variations in the level of intellectual performance. Hence 
Bender explored the possibilities of this test of perception by in- 
vestigating the “gestalt functions” in cases of aphasia, organic brain 
disease, schizophrenia, manic depressive psychoses, mental defectives, 
malingerers, psychoneurotics, and normal children. 

Uniform directions for administering the test and scoring or ana- 
lyzing the results have not yet been evolved. M. L. Hutt, who used 
this test extensively in the armed forces, has suggested that the draw- 
ings be evaluated for the following principal aspects: 


Arrangement of the drawings: order (methodical, 
lar, confused); cohesion (expansive or com 
margin as a guide. 

Modifications in size: reduction or expansion. 

Use of white space: space left blank between drawings. 

Modifications of the Gestalt: elaboration, distortion, destruction. 


Motor incoordination: poor motor control representing Physiologi- 
cal tensions or poor muscle tonicity. 


logical, irregu- 
pressive style); use of 


Bender herself offers a different evaluation scheme, the essentials of 
which are: 


Movements used in makin 
severation. 
Form of the drawing: e.g 


g., Outline, arrangement, spatial orienta- 
tion, form differentiation, size, omissions. 


g drawings: e.g., speed, rhythm, per- 


In either case, the examiner also observes the behavior of the sub- 


ject during the testing process, noting the methods used, verbaliza- 
tions, and attitudes. 


Subsequent to both of the foregoing an 
Suttell provided a standardized a 
reproductions of adults. 
rived at thus: (1) the repr 
pared to those of normal 
sons tended to deviate 
normal subjects; (3) dev 


alytic schemes, Pascal and 
nd quantitative system of scoring the 
Essentially the scoring procedure was ar- 
oductions of psychiatric patients were com- 
persons; (2) the drawings of abnormal per- 
from the originals more than did those of 


iations (differences between originals and re- 
productions) that discriminated between normal and psychiatric sub- 
jects were isolated; (4) a deviation was retained 


signee A if item analysis 
showed that it discriminated significantly between the 


two groups, or if 


“G. R. Pascal and B. J. Suttell, The Bender-Gestalt Test, New York: 
Grune and Stratton, 1951. 


The Bender Visual-Motor Gestalt Test 449 


it occurred in the reproductions of abnormals but “practically never” 
in those of normals; (5) deviations were weighted according to dis- 
criminative value between the two groups; (6) “score reliability” was 
determined (r = .90; N = 120); (7) norms were found for a non- 
patient population of 474 subjects (271 males; 203 females) varying 
in age from 15 to 50 years, most of whom were attending evening 
classes at the high-school or college level. 

After the scoring method had been evolved and the norms deter- 
mined, validity was studied by: (1) “blind” matching of scores with 
group classification (normal, neurotic, psychotic); and (2) prediction 


TABLE 47 
Bender-Gestalt Test Means and Ranges of Scores 
Mean 
Scores * Middle 60% Range 
Normals 50 47-60 32-79 
Neurotics 68 53-S0 32-139 
Psychotics 81 65-100 40-155 


* Values are in terms of Z scores; mean equals 50, standard 
deviation equals 10. The higher scores are the more unfavorable. 
(Based upon data in Pascal and Suttell, op. cit., pp. 30-31.) 


of improvability of patients receiving therapy, tested on admission. 
The test was fairly effective in distinguishing between normals and 
psychotics and between normals and neurotics; but it discriminated 
only slightly between psychotics and neurotics. As between patients 
who were described, upon discharge, as “improved” or “unimproved,” 
under (2), the mean scores for the two groups showed significant dif- 
ferences, The results under both (1) and (2) were sufficiently dis- 
criminating to encourage the use of the Bender-Gestalt as one possible 
source of significant evidence in the diagnosis of normal and ab- 
normal personalities. 

As in all such situations, it must be noted that while the mean scores 
and standard deviations for the groups are significantly different, there 
is still significant overlapping of scores of the three groups (normals, 
neurotics, psychotics). Overlapping of scores within different groups, 
classified for various purposes, is the usual psychological phenomenon. 
In dealing with an individual, therefore, in respect to a particular trait 
or function, it is always necessary to consider the possibility that he 
might deviate from the central tendency of his group. Specifically, the 


450 Tests of Mental Impairment 
45 


scores on the Bender-Gestalt will illustrate this point, as shown in 
Table 47. 


EVALUATION OF TESTS OF IMPAIRMENT 


Psychological tests in this area are based upon the principle 
that old, well-established habits and modes of behaving (such as word 
knowledge) show relatively little loss, whereas new learning, newly 
acquired associations, performing new tasks and solving new types of 
problems are impaired in cases of brain damage and other forms of 
mental disturbance. 

As a group, these psychological tests have not been adequately 
standardized, so far as norms, scoring, and interpretation are con- 
cerned. It must be recognized, however, that standardizing tests for 
the extremely deviant and often uncooperative populations for whom 
these tests are intended is an extraordinarily difficult task. Since these 
tests are intended for adults primarily, their standardization and in- 
terpretation are further complicated by the fact that allowance should 
be made for normally expected loss at advanced ages; and individual 
differences in cultural, educational, and occupational backgrounds 
must be considered in evaluating a subject’s performance. Validation 
is made the more difficult, too, since one generally used criterion of 
validity is psychiatric diagnoses and classifications; and these are 
themselves sufficiently inconsistent and lacking in reliability as to in- 
troduce an important source of error in the validation process. 

These tests do not, as yet, provide a self-sufficient method for 
measuring mental deterioration, except in the more marked c 
However, clinicians who have used them 
that they have value in th 


ases. 
are in substantial agreement 
at they provide opportunities for observation 
of mental operations under controlled conditions, in which prescribed 
materials are used. In such situations, an experienced psychologist is 
able to make important qualitative observations, in addition to de- 
riving, at times, quantitative values for the subject’s performance. The 
particular qualitative observations that can be made will depend upon 
the content and technique of the test. The observations on schizo- 
phrenics quoted on page 434 provide relevant illustrations. Other 
qualitative observations might include descriptions of the subject’s 
thought processes, estimates of levels of abstraction or concreteness 
(as in the Goldstein and Hanfmann Series), bizarre responses, degree 
of self-criticism, fluctuations of attention, degrees of rigidity or flexi- 


Evaluation of Tests of Impairment 451 


bility of thought processes, level of immediate and delayed recall com- 
pared with recall of remotely learned materials, and evaluations of the 
subject’s performance in the light of his former educational and oc- 
cupational status. To make these observations, of course, requires a 
background of experience with a sufficiently large and varied number 
of subjects, including persons within the normal range of adjustment 
and performance. 


17. 


av 
mummunan nnmnnn 


PERSONALITY RATING SCALES 


DEFINITION OF PERSONALITY 


This term has been variously defined because personalities 
are complex and inclusive of all traits; hence, there is much room for 
differences in comprehensiveness of the definition. Those definitions, 
however, which include only the social value of an individual to 
other members of his group (that is, more or less superficial attrac- 
tiveness or reputation) must be rejected as inadequate because they 
are concerned only with overt behavior, while they ignore the inner 
aspects of the personality: the perceptions, feelings, reactions, at- 
titudes, values, prejudices which are the basis of one’s behavior. 
These social-value definitions, in other words, are concerned only 
with what a person does and the impression made by him upon 
others in his social groups. Such impressions and evaluations are, of 
course, important in an individual’s life; and they are evaluated by 
means of rating scales, which will be presented in this chapter. These 
definitions do not, however, take into account what are the basic 
traits of a person, aside from what he actually does, which might or 
might not reveal the covert aspects of his personality. As a matter 
of fact, some parts of the personality inventories, as distinguished 
from rating scales, are intended to identify these covert traits in order 
to provide the psychologist with a basis for a fuller understanding 
of an individual's behavior. Whether they succeed in doing so is a 
question to be discussed. Also, projective methods, more than any 
other type of instrument, are intended to reveal the covert, subtler 


Definition of Personality 453 


aspects of personality and behavior. These methods will be dealt 
with in the following chapters. 

A definition which commends itself is the following: a personality 
is the product of the dynamic and unique organization within the 
individual of psychobiological structures, or systems, and their inter- 
action with the environment. It is these two aspects—uniqueness of 
the structured organism and the characteristics of his environment 
—that determine the individual’s particular adjustments to his sur- 
roundings.' A personality is the individuality that emerges from inter- 
action between a psychobiological organism and the world in which 
he has developed and lives. 

Personality is described in terms of an individual's behavior—his 
actions, postures, words, and attitudes and opinions regarding his. 
external world. But personality is described, also, in terms of the 
individual’s covert feelings about his external world; feelings which 
may not be apparent or discernible in his overt behavior. Further- 
more, it is described in terms of one’s feelings about himself. 

One’s actual feelings about his external world and about himself 
may be at the conscious, preconscious, or unconscious level. The 
same is true regarding the consciousness levels of the reasons for 
these same feelings. In other words, a person may know why he 
feels as he does (conscious); or the reasons for his feelings may be 
somewhat below (figuratively) the level of awareness, but such as 
to come to awareness with relatively little effort under appropriate 
stimulation (preconscious ). Or the factors may be so deeply sub- 
merged or blurred that they can be brought to the level of aware- 
ness only with difficulty, if at all (unconscious). This being the case, 
then, it is desirable to have tests of personality which can probe the 
various aspects of personality and the levels of consciousness as well. 

Several aspects of the foregoing definition need a little more ex- 
planation before the instruments themselves are presented. 

By dynamic organization psychologists mean that personality traits 
do not exist independently or act in isolation. They are interrelated, 


2 Por differing emphases in defining personality, see G. W. Allport, Per- 
sonality, A Psychological Interpretation, New York: Henry Holt, 1937, 
P- 48; and Gardner Murphy, Personality, A Biosocial Approach to Origins and 
Structure, New York: Harper. 1947, Chapter 1. These two books are, to the 
Present writer, the outstanding treatises of recent years on the subject. 


454 Personality Rating Scales 


interacting in an organized and coherent manner. They may, like any 
other organized system, be in process of change and evolution. Dis- 
organization of traits and behavior results in “abnormality.” It is in 
respect to the dynamic organization of traits that one of the great 
difficulties of personality testing is encountered, for it is much sim- 
pler to attempt the construction of an inventory that will give an 
indication of an individual’s tendencies to introversion, or sociabil- 
ity, or self-confidence, or ascendency, etc., than it is to measure or 
test the person as a whole. The reader will note, later, that actually 
the available inventories and rating scales deal only with smaller or 
larger segments of the personality, a number of which may be por- 
trayed on a psychological profile; but to be meaningful these must 
somehow be organized into a meaningful whole by the psychologist 
who studies the individual case. 

The term psychobiological structures connotes motives, habits, 
traits, attitudes, feelings, values, ways of thinking and acting. The 
word “psychobiological” is used to indicate that personality and its 
component integrals are neither exclusively mental nor exclusively 
neurological. Rather, they involve psychological processes and func- 
tioning (mind) together with their biological correlates (body), 

Interaction with the environment is made explicit in order to 
emphasize that an individual's personality does not merely grow from 
within. It is the product of the interaction between himself as a de- 
veloping organism having certain psychological and biological needs, 
on the one hand, and, on the other, his environment which has nur- 


tured, influenced, directed, satisfied, or in varying degrees failed to 
satisfy those needs. 


RATING SCALES: MAJOR ASPECTS 


„This type of device is useful chiefly to learn what impression 
an individual has made, in respect to some specified traits or atti- 
tudes, upon persons with whom he has come in contact. To a lesser 
extent it is also used for self-ratings. It is a device with which “social 
value” is rated in certain specified areas; it reflects the impression the 
subject has made upon the persons who do the rating. The rating of 
one person by others is among the oldest of practices, the present 
psychological tools being refinements upon the common practice of 
providing letters and oral recommendations. 


For the evaluation of an individual, rating scales are submitted to 


Rating Scales: Major Aspects 455 


teachers, counselors, employers, colleagues, parents, and others who 
have had sufficient contact with the person in question to have 
formed an opinion based upon evidence. Usually, of course, ratings 
of a particular person are obtained from more than one judge; for 
validity of ratings is thereby increased, inasmuch as subjectivity of 
judgment is decreased through the balancing of errors and bias. 

Rating scales may be devised for a great variety of personality 
traits: tact, generosity, leadership, cooperativeness, resourcefulness, 
punctuality, industriousness, honesty, emotional control, study hab- 
its, personal attractiveness, and many others, the number of possibili- 
ties being virtually unlimited. Each scale usually includes traits to 
be rated individually, the specific ones depending upon the purposes 
for which the scale is intended. The terms used to designate each 
of the several traits being rated are often vague in themselves and 
may have different meanings for different judges. In order to mini- 
mize this problem of semantics and to make the rating scales more 
useful than they would otherwise be, it is necessary to observe certain 
established principles. The following, then, are the major aspects to 
be considered in their construction and use.* 

The traits must be clearly defined. This is essential so that they 
may be clearly and uniformly understood by all judges. This end may 
be achieved by giving explanations, synonyms, or specific instances as 
behavioral illustrations. 

The degrees of the trait must be clearly defined. Each trait is rated 
on a scale generally having four or more intervals, five and seven be- 
ing the most frequent. A larger number of intervals requires refine- 
ments of distinctions and ratings which are not often possible. Each 
step on the scale of each trait must be clarified in much the same 
way as the trait definitions themselves. 

The scale may be of the scoring or ranking type. These are the 
two basic types, although there are variations. In the first type, the 
subject is rated at a point, or level, on the scale without direct refer- 
ence to or comparison with other persons in his group (his classroom, 
his fellow workers, his fellow club members, and the like). Each 
point, or level, on the scale carries a specified score. On a scale of 
five, for example, the “average” person may be scored 0; in which 
case the deviants would be scored —1, —2, +1, or +2. Or the scores 


? Cf. Allport, op. cit., Chapter 16. 


456 Personality Rating Scales 


might all be positive, the lowest rating being 1, the average 3, and the 
highest 5. i f . 

A common variant of the scoring scale is the graphic rating scale. 
The several levels, or degrees, of the trait are defined and placed at 
points along a horizontal line. The judge places a mark anywhere he 
chooses on this line, between the two extremes. Although a graphic 
scale permits, theoretically, scoring at a very large number of points, 
such refinement and spurious accuracy are not warranted. The in- 
vestigator or compiler of information will, therefore, convert each 
rating within a given range according to a predetermined numerical 
scheme. 


The following are examples of the numerical rating device. 


How emotional is the parent’s behavior where the child is con- 
cerned? 


Constantly gives vent to unbridled emotion in response to 
child’s behavior. 


Controlled largely by emotion rather than by reason in 
dealing with child. 


Emotion freely expressed, but actual Practice is seldom 
disorganized. 


Usually maintains calm, objective behavior toward child, 
even in face of trying situations. 


Never shows any sign of disorganization toward child. 


Does he get others to do what he wants done? 


Display marked ability to lead his fellows 
Sometimes leads in important affairs 
— Sometimes leads in minor affairs 

Lets others take the lead 
————Probably unable to lead his fellows 

No opportunity to observe 


Each rater checks what he believes to be the correct description, The 
investigator then will convert the ch 


2 W eck marks into scores. In this 
instance, the third item would be the average and would be scored 
zero; the first and second would be 


—2 and —1, respectively; the 
fourth and fifth, +1 and +2, respectively. 


The graphic type of item may be illustrated by the following, re- 
membering the necessity of de 


finition of the trait and clarification 
of levels. 


Rating Scales: Major Aspects 457 


Quality of work 


| | | ! | 1 | 


Of doubt- Not quite Satisfac- Superior Exception- 
ful satis- up to tory. to gen- ally high. 
faction. standard. eral run. 


Attitude toward others 
l l l ! | ! | | 


Quarrelsome, At times Ordinarily Always con- Unusually 
uncooperative, difficult tactful, co- genial and strong fac- 
upsets morale, to work operative, cooperative. tor in co- 
with. and self- operation 
controlled. and group 

morale, 


The ranking scale is used with persons who are associated within 
a single group and who are to be rated relative to one another. The 
judge arranges the names in serial order with regard to each one’s 
status in a specified trait. Usually, the judge is instructed first to 
select the individuals to be ranked highest, lowest, and average, and 
then to place the others in relation to these three. Since intervals be- 
tween successive individuals are not equal, and since it is impossible 
by this method to determine the sizes of the intervals, arithmetical 
and statistical computations are not warranted and should not be 
attempted. The ranking scale simply provides a method for use with 
a single group of subjects when, for any valid reason, intragroup com- 
parisons are desired. 

Reliability depends upon extent of variation of judges’ ratings. 
Judges rating an individual on a specified trait will not always agree 
as to his score or rank. It is customary, therefore, to take the mean 
of all the judgments as representing the nearest approximation to 
the true rating. If this method of averaging is to be meaningful, how- 
ever, it is essential that the variation of the judges’ ratings shall be 
small, thus indicating reasonably close agreement. A large variation, 
on the other hand, would indicate unreliability due to lack of clarity 
regarding the trait being evaluated, contradictory or unstable be- 
havior of the subject, or undependability of some judges. An aver- 
age of the judges’ ratings, without regard to their variation might be 
misleading or even absurd. For example, if, on a seven-point scale, 
two judges rated an individual at —3, two at +3, and two at zero, 


458 Personality Rating Scales 


the mean rating would be zero (or average level), whereas the prob- 
ability is that he is not average at all in view of the wide disparity of 
judgments. Reliability of ratings is usually dependent upon having 
a sufficient number of qualified judges, five to seven being the num- 
ber frequently recommended. The degree of agreement required be- 
fore a set of judgments may be regarded as sufficiently reliable is, 
to some extent, arbitrarily determined; but to be statistically reliable, 
agreement among judges should be three or four times as great, at 
least, as that obtained by chance. 

At times it is possible to find out, by inspection of 
which judges seem to be most dependable by noti 
which each of their ratings approximates the mean 
trait. Similarly by inspection, it is 


all the ratings, 
ng the extent to 


Methods of studying reliability of rating scales 
ing, most commonly: repeating judgments after a time interval; cor- 
relation between ratings of two or more judges; and relationship be- 
tween judges’ ratings and self-ratings.* The correlation coefficients 
thus found are in the neighborhood of .50 and .60—much lower than 
would be acceptable in the case of tests of general intellige 
aptitudes, or educational achievement. Oc 
higher reliability coefficients 
neighborhood of .85 to .90. 

The relatively low reliabilities 


fects in the conception of a sc 
acteristics to be r: 


include the follow- 


nce, specific 
casionally, however, much 
have been obtained, some being in the 


are not necessarily attributable to de- 
ale nor in the phrasing of the char- 
ated. The coefficients reflect, to a considerable ex- 


es between and the unreliabilities 
Determination of validit 


criteria of validity, i 


an scores, etc. In fact, 
Seve JUS De i validating criteria with 
“See J. P. Guilford, Psychometric Methods, New York: McGraw-Hill, 
1936, Chapter 9. 


Rating Scales: Major Aspects 459 


other types of tests (e.g., intelligence, personality inventories) are the 
measures of reliability of rating scales, as indicated above. In other 
words, the validity of the rating scale is assumed, in actual practice, 
to rest upon the judges’ understanding of the meanings of the traits 
being evaluated and their accuracy in rating them. The principal indi- 
cation of validity of some rating scales is the fact that persons using 
them—guidance counselors, personnel officers, employers—find them 
helpful if the judges are carefully selected and if the ratings are con- 
scientiously made. 

Overt traits are more reliably rated than covert. Traits which can 
be rated upon objective activities, on the basis of actual past or present 
performances known to the judges, are most reliable. For example, 
general emotionality, social acceptability, aggression, fear, anxiety, 
and impulsion are rated with greater reliability than those dealing with 
a person’s inner life and feelings about the self." 

Judges should be instructed. It has already been stated that the 
traits to be rated and the intervals on the scale should be clearly de- 
fined for guidance of the judges. In addition, it is necessary to in- 
struct them with regard to other aspects affecting the reliability of 
their judgments: namely, each trait rating should be made inde- 
pendently of other ratings of the same person (avoidance of the 
“halo” effect): ratings should not rest upon inadequate acquaintance 
with the subject; experience and acquaintance with a broad enough 
variety of people to provide bases of judgment are desirable; sincere 
Motivation to provide the most reliable ratings possible is necessary. 
Of the foregoing, the “halo” effect has been shown to be among the 
Most serious causes of unreliability. One tends, for example, to over- 
rate a person in all respects if he likes him, or if he is a close ac- 
quaintance. 

Judges should state their degree of certainty. With each rating, the 
judge should be asked to state his degree of certainty (e.g., very 
Strong, strong, moderate). It has been demonstrated that judges are 
Most confident and most in agreement on ratings at the extremes. This 
is understandable because extreme deviants are most distinguishable 
from others and are most readily characterized by the trait names. 
Thus, the terms cooperative-uncooperative, introverted-extroverted— 


*See R. Wolf and H. A. Murray, “An Experiment in Judging Personalities,” 


Journal of Psychology, Vol. 3, 1937, pp. 345-365. 
5A similar technique is now often employed in polling public opinion. 


460 Personality Rating Scales 


these apply most forcibly and clearly to those individuals who are 
manifestly one or the other. 

Some persons are more accurately rated than others. On the 
whole, extroverted individuals are more reliably judged than intro- 
verted. Quite understandably, the ratings of persons whose traits are 
characterized by overt behavior rather than by covertness and inner 
qualities will be based upon fuller, more representative, and better 
understood behavior samples. Also, it has been found that judges rate 
more reliably those persons who most resemble themselves because, 
it is believed, one can best empathize with people whose behaviors 
resemble his own.° 

Reliability of trait estimates is affected by desirability or unde- 
sirability of the trait. In self-ratings there is a tendency for indi- 
viduals to overrate themselves in respect to traits regarded as socially 
desirable. In rating other persons, especially friends, some judges may 
be similarly influenced even while trying to be conscientious. In fact, 


there is a general tendency toward generosity in ratings, rather than 
the reverse. 


REPRESENTATIVE RATING SCALES 


A few scales, representative of the more satisfactory ones, will 
be briefly described. 


Haggerty-Olson-Wickman Rating Schedules.’ 
for the detection and study of behavior problems 
cies in individuals from nursery-school throug 
Schedule A is a behavior-problem record, enu 


These are designed 
and problem tenden- 
h high-school levels. 


merating fifteen types 
or sources of problems, such as speech difficulties and defiance of 


discipline. Each of the fifteen is rated from 1 to 4, depending upon 
frequency of occurrence. Schedule B is a graphic scale, consisting of 


thirty-five traits classified into four groups: intellectual, physical, emo- 
tional, and social. These traits a 


Te scored on a 5-point scale. The 
authors provide better than usual evidence of validity. They report a 
correlation of .76 with frequency of referral for reasons of discipline 
or other action by school principals, which may be symptomatic of a 
variety of adjustment difficulties. They report, also, that only ten per- 


êR. Wolf and H. A. Murray, op. cit. 
* World Book Co., 1930. 


Representative Rating Scales 461 


cent of “normal” children reach or exceed the median score of those 
referred to the psychological clinic. 


The Vineland Social Maturity Scale. This scale is unique in having 
been constructed and standardized on the model of the Stanford-Binet 
scale. It is designed for use with individuals from infancy to the age 
of thirty years. 

Unlike many other scales, this one is based upon a well-defined 
rationale and has been systematically constructed. Behavior items are 
grouped at age levels, as in the Stanford-Binet. The items represent 
progressive maturation and adjustment to the environment in the 
following categories: self-help, self-direction, locomotion, occupa- 
tion, communication, and socialization. The following examples il- 


lustrate the several categories. 


Self-help: Reaches for nearby objects (age 0-1) 
Self-direction: Buys own clothing (age 15-18) 
Locomotion: Walks about room unattended (age 1-2) 
Occupation: Helps at little household tasks (age 3-4) 
Systematizes own work (age 25+) 
Communication: Makes telephone calls (age 10-11) 
Socialization: Demands personal attention (age 0-1) 
Advances general welfare (age 25+) 


Items are scored after interviewing someone well acquainted with 
the subject, or the subject himself. A social age is then obtained; this 
is divided by chronological age, yielding a social quotient (S.Q.). 

Although this social maturity scale shows a high correlation with 
intelligence test results (about .80), Doll maintains that it is distinct 
enough in content and in the functions rated to warrant its use in the 
study of an individual's general behavioral development, since social 
age provides a basis upon which to proceed in the care and training 
of an individual. 

While the scale is intended for use with a normal population as 
well as with mentally deficient, it was first conceived as an aid in the 
diagnosis of feeblemindedness. In the first instance, it was and still is 
intended to differentiate between mentally deficient individuals who 
are also socially inadequate, on the one hand, and, on the other hand, 


8 By E, A. Doll. The Training School, Vineland, N. J., 1936. Also by Doll, 
“Preliminary Standardization of the Vineland Social Maturity Scale,” Ameri- 
can Journal of Orthopsychiatry, Vol. 6, 1936, pp. 283-293. 


462 Personality Rating Scales 


the mentally retarded who are able competently to conduct their 
personal and social lives. 


The Vineland scale has had wide use in clinics for children and 
adolescents; because, in addition to the uses already indicated, it is a 


valuable device for interviewing and counseling both parents and 
children, 


Progressive Education Association Behavior Description.’ This is 
intended for use by teachers. The traits included are the following: 
responsibility-dependability, creativeness and imagination, influence, 
inquiring mind, open-mindedness, power and habit of analysis, social 
concern, emotional responsiveness, 

adaptability, work habits, physical ene 
emotional control. This is a rather inclusive and varied list of traits 
which could be rated only by a judge who had had long and varied 
contact with the subject. It will be noted that the traits include com- 


plex intellectual activities as well as those more often found in per- 
sonality rating scales. 


seriousness of purpose, social 
Tgy, assurance, self-reliance, and 


The Fels Parent Behavior Scales.” 

scales, in as many aspects, for ass 
their children. Ratings are made by 
those aspects of the home environme 
observe and review in a systematic 
and interviews with p 


This device provides thirty rating 
essing parental behavior toward 
a qualified observer in respect to 
nt that the rater has been able to 
manner, by means of home visits 
arents. Among the rated thirty aspects of child- 
parent relationships are the following: discord in home, sociability of 


family, child-centeredness of family, restrictiveness of regulations, 
readiness of criticism, rapport with child. 

In clinical and research work, these sc 
in providing 
chological en 


ales have been found useful 
a systematized approach to the detailed analysis of psy- 
vironments in which children are developing. The scales 
provide a standardized method for describing and estimating the na- 
ture of the effect of a child’s home upon his behavioral development. 


Other Scales. The re 


ader will have observed that the instruments 
listed above are all inte 


nded for school and clinical use. The re 
” Reports and Records Committee of the Prowresive Educati 
1938. 
w 


ason is 


on Association, 


Fels Institute, Yellow Springs, Ohio, 1937—1949, 


Representative Rating Scales 463 


that in schools and clinics are to be found the better devices of thi: 
kind. There are, of course, innumerable rating scales that are bene 
used in business and industrial personnel departments social Es 
and counseling organizations; `“ but they are not different ite those 
presented above, nor are they original in conception; and relativel 
few of them have been subjected to the scrutiny to which scales ae 
by psychologists in schools and clinics have been subjected. The item 
that follow are from a pre-World War II rating scale prepared for = 
used in a specific industry; but the items in this scale are typical of 


those widely used in similar situations. 


Relations with other supervisors: 
a. Often not satisfactory 
b. Sometimes not satisfactory 
c. Usually gets along well 
d. More satisfactory than average 
e. Exceptionally satisfactory 
Knowledge of the characteristics and abilities of subordinates: 
a. Knowledge markedly limited i, 
b. Knowledge somewhat limited 
c. Knows employees fairly well 
d. Knowledge better than average 
e. Knowledge exceptional 
Willingness to make difficult decisions: 
a. Often “passes the buck” 
b. Inclined to “pass the buck” 
c. Usually properly willing 
d. More willing than average 
e. Exceptionally willing 
Resourcefulness in meeting difficulties: 
a. Often goes to pieces 
b. Easily discouraged by obstacles 
c. Usually meets the situation 
d. More resourceful than average 
e. Nearly always finds good way out 
Ability to learn new work: 
a. Learns with difficulty 
b. Learns somewhat slowly 
c. Learns fairly easily 


d. Better than average 
e. Learns with exceptional ease and speed 


"See, for example, W. D. Scott, P. 
: ample, W. 1 , Personnel Management sA 
McGraw-Hill, 1941), passim. nt (New York: 


464 Personality Rating Scales 


In the armed forces during World War II, rating scales were de- 
vised to assist in personnel evaluation and selection in a variety of 
situations.’* These followed the usual Principles and practices already 
discussed. They need not, therefore, be described here. 


EVALUATION OF RATING SCALES 


Rating scales are not tests; nor are the 
measures; hence, their reliability 


other types of psychological instr 


y precise or objective 
cannot be so high as that found for 


uments. Rating scales do provide a 
means of obtaining organized descriptions of behavioral traits from 


judges who have had ample opportunity to make the necessary ob- 
servations. If the scale meets the specifications already presented in 
this chapter, these ratings will be based upon greater uniformity of 
trait-definition and trait-connotation than will purely individual rat- 
ings of traits defined by each judge independently and for himself, 

By means of an organized scale it is possible to obtain ratings on 
specified traits which are considered essential or significant in the 
particular setting where the scale is being used. Completely inde- 


pendent, or “unstructured,” ratings, by contrast, may fail to provide 
desired information, 


While high reliability 
would simplify interpret 
liability coefficients do 


of ratings among judges of the same persons 
ations of results, moderate or even low Te- 
not discredit a scale which is soundly con- 
ceived and developed. The reason for relatively low reli 

efficients may be one or a combination of the following two factors. 
(1) In spite of carefully defined traits and degrees thereof, ratings are 


always subject to the judges’ biases, values, and standards of per- 
formance and behavior. Some degree of infi 
the judges’ own personalities is 
being rated may be 


types of situations. This is the c 


ability co- 


avior of the person 
variable in different 


“self-confident” in one sit tion is more prone to be “self-confident” 
in another, where ‘withdrawn” is prone to behave 
me individual is not equally self- 
all occasions; and, in f 


act, there 
aracteristic 


mode of behavior is not 


and Test Development in 
ty Press, 1947, passim. 


Personnel Research 
l, Princeton Universi 


Representative Rating Scales 465 


readily apparent. Variability of behavior is especially true of children 
and adolescents whose personality traits are still in process of forma- 
tion. In interpreting ratings, it is necessary, therefore, to know the 
types of situations in which each judge made his observations. 

The usual criteria and standards of validity are not applicable to 
rating scales. Theirs is a matter of internal or face validity. The ques- 
tions to be asked in regard to validity of a rating scale are these: Does 
it meet the specifications of a sound scale? Are the traits which are 
being rated by the scale significant in the setting or Occupation for 
which the individual is being considered? If these two questions are 
answered satisfactorily, then the ultimate usefulness (i.e. predictive 
validity) of the scale will depend upon the soundness (reliability) of 


the judges’ ratings. 


18. 


w 
ANAVAA IAIA IIAL VAAVAAI ISAAT SSS oo os so acos soon coca anssnnncuacococcsaccaccoc cocci ntccucccceenececrerreneececrecccc 


PERSONALITY INVENTORIES 


PURPOSES AND TYPES OF INVENTORIES 


It has been estimated that there are in the neighborhood of 
five hundred personality tests and inventories, Obviously, there would 
be no point in attempting to describe all of them, nor even a large 
number; for they have much in common, and many of them are so 
inferior in conception and validation as to merit complete neglect. 
The inventories which are briefly presented in the following pages are 
among those that have been most widely used and are representative 
of the group as a whole. These criteria, however, by no means imply 
that they are wholly satisfactory instruments. They are more or less 
useful, within limits and ir the hands of qualified psychologists. Their 
limitations will be indicated in the section on evaluation. 

Rating scales, for the most part, are intended to reveal how other 
persons (the judges) respond to or have been impressed by the sub- 
ject; that is, these scales provide evidence of the value pl 
the subject in certain group situations, Perso 
the other hand, are self-rating questionnaires which deal not only 
with overt behavior (e.g., insisting on having one’s own way, emo- 
tional expression, sympathetic acts) but also with the person’s own 
feelings about himself and his environment, resulting from introspec- 
tion (e.g., liking to be alone and living introvertly, need for praise, 
repression of desires, caution and worry). Insofar, then, as inven- 
tories actually get at aspects of personality which are beyond im- 


pressions made upon observers, and resulting reputation, they are 
the more valuable instruments. 


aced upon 
nality inventories, on 


Purposes and Types of Inventories 467 


Personality inventories may be classified into four types: (1) those 
that assess specified traits (e.g., ascendance, conservatism, self-con- 
fidence); (2) those that evaluate adjustment to several aspects of the 
environment (e.g., home, school, community); (3) those that classify 
into clinical groups (e.g., paranoiac, psychopathic personality) ; and 
(4) those that screen subjects into two or three groups (e.g., psy- 
chosomatic disorders versus normal). An example of the first of these 
is the Bernreuter; of the second, the California; of the third, the Min- 


nesota Multiphasic; of the fourth, the Cornell Index. These inventories 


are among those described in this chapter. 
The classification into four groups does not signify that the in- 


ventories in each group have nothing in common with the others. The 
differences between them are dependent upon purposes, organization 
and nature of content, and scoring categories. Fundamentally, nearly 
all personality inventories are based upon the principle that behavior 
and personality are, in part, manifestations of certain traits, and that 
the strength of traits can be measured. 

A trait may be defined as a generalized mode of behavior or a form 
of readiness to respond with a marked degree of consistency to a set 
of situations which are functionally equivalent for the person. A trait 
is a form of adaptive or expressive behavior employed by the indi- 
vidual in situations that He perceives as having some equivalence. 
Thus, if a child readily volunteers information and opinions in school 
classrooms but is reticent in all other situations, his classroom be- 
havior would be regarded as a “habit” rather than as a trait. However, 
if this child’s classroom self-confidence extended into a variety of 
situations—that is, if it is generalized—his self-confidence would be 
designated as a trait. Also, if a person always votes for candidates of 
the most conservative party, this might be only a habit; but if con- 
servatism is his characteristic mode of responding to a variety of 
situations (along a scale of conservatism-radicalism), then con- 
servatism is one of his traits. f 

Thus, in personality inventories an effort is made to estimate the 
presence and strength of each specified trait through a number of items 
representing a variety of situations in which the individual's gen- 
eralized mode of responding may be sampled. The traits selected for 
measurement in a particular inventory are those which are present in 
varying degrees, and can be compared, among the members of the 


population for whom the inventory is intended. 


468 Personality Inventories 


REPRESENTATIVE INVENTORIES 


The Allport A-S Reaction Study.’ The Allport is a self-rating 
inventory for college students and, in revised form, for business use, its 
purpose being to disclose the degree to which the individual tends 
to dominate others (ascendance) or be dominated by them (submis- 
sion). The inventory presents a variety of situations which are likely 
to confront a person in the normal course of daily living. For each 
situation, several possible answers are presented; the subject indicates 


the one which most adequately characterizes his behavior. For ex- 
ample: 


If a student in class discussion make 
neous, do you question it? 

usually 

occasionally 

never 


s a statement that you think erro- 


The Bell Adjustment Inventory.’ 
tended to evaluate the subject’s sta 
faction or dissatisfaction with home 
social adjustment (extent of shynes. 
emotional adjustment (extent of 


This consists of questions in- 


- For example: “Are 


you troubled with shyness?” “Are you often sorry for the things you 


do?” “Do you daydream frequently?” 
The Bell Inventory raises a problem which js common to all de- 

vices of this kind: namely, do the 

category actually represent separat 

and adjustment? Are these aspect 


i s mutually exclusive? Some critics 
maintain they are not. They hold, 


on the contrary, that the same per- 


1 By G. W. and F. H. Allport. Houghton Mifflin Co., 
personality inventories, the W, 


devices have been 
largely superseded by more rec 


? Stanford University Press, 1934-1938. 


Representative Inventories 469 


sonality variables influence adjustment in all situations and, therefore, 
that the more useful and significant inventories are those which probe 
the various psychological mechanisms such as hysteria, defense (e.g., 
rationalization and projection) and escape techniques (e.g., nega- 
tivism, suppression), psychosomatic manifestations, etc. Other psy- 
chologists, while recognizing the instrument’s inability to reveal the 
dynamics of behavior, nevertheless believe it is useful in placing the 
individual relative to a group in respect to the specified areas of be- 
havior, and as a basis for further psychological interviewing. While 
the first criticism is warranted, the Bell inventory has found wide and 


justified use for the latter purpose. 


The Bernreuter Personality Inventory.” This instrument is a ques- 
tionnaire intended for use in grades 9 to 16, and with adults. Although 
the items are not arranged into categories, they are scored for six 
traits: neurotic tendency, self-sufficiency, introversion-extroversion, 
dominance-submission, confidence, and sociability. The last two of 
these were added by J. C. Flanagan after factor analysis. The items 
themselves and the manner of answering (yes, no, ?) are not new. 
The method of scoring is, however, not typical; for each of the re- 
sponses to each item is regarded as characteristic of several traits, the 
scores for each item being weighted on the basis of empirically or 
Statistically determined differentiating power. There are thus six scor- 
ing scales, one for each of the specified traits. 

This inventory, like others of its kind, has been criticized as having 
set up arbitrarily determined personality categories and as being in- 
adequate for individual diagnosis. Its principal value is as an aid 
in identifying persons at the extremes of the scale, as an early step 
in their full psychological study. Several sample items follow: “Have 
you ever crossed the street to avoid meeting a person?” “Are you in- 
clined to study the motives of people carefully?” “Do people ever come 


to you for advice?” 


The California Test of Personality.’ Of the questionnaire type, this 


inventory is“. . . organized around the concept of life adjustment as 


* Stanf iversity Press, 1931-1938. See also J. C. Flanagan, Factor 
P e fier bed of Personality (Stanford: Stanford University Press, 


1935), A 
+ By L. P. Thorpe, W. W. Clark, and E. W. Tiegs. California Test Bureau, 


Los Angeles, 1953. 


o Personality Inventories 
47 


a balance between personal and social adjustment.” There are five 
scales: primary, elementary, intermediate, secondary, and adult. The 


questions, answered either “yes” or “no,” are grouped under the fol- 
lowing categories: 


Personal adjustment: self- 
of personal freedom, feeling 
nervous symptoms. 


Social adjustment: social standards, social skills, anti-social tend- 


encies, family relations, school relations, occupation relations (adult 
level only), community relations. 


reliance, sense of personal worth, sense 
of belonging, withdrawing tendencies, 


This broad two-fold division is consistent with a rather frequent 
practice of classifying adjustment difficulties into “personality” prob- 
lems (personal adjustment) and “conduct” Problems (social adjust- 


at any particular focal malad- 


other of these categories; for, 
in fact, the whole Person and his environment are involved in be- 
havioral difficulties or disorders. What these two major categories and 
their subdivisions do is assist in identifying some of the principal 
Sources of an individual's problems. This inventory, like the Bell, pro- 
i hat may be symptomatic of malad- 


able in subsequent Psychological inter- 
view and treatment. 


Several items from the intermediate inventory follow, 


Do you keep on workin 

Do you find th 
sonal worth) 

Is it hard for you to say nice things to people when they have 
done well? (social skills) 

Do you often visit at 
your neighborhood? (co; 


8 even if the job is h 


ard? (self-reliance) 
at a good many people 


are mean? (sense of per- 


the homes of your boy 


and girl friends in 
mmunity relations) 


of school children; and for tł 


° By R. Pintner, J. J, Loft 


us, G. Forlano, and B. Alster. World Book Co- 
1938. Items reproduced by s 


pecial permission. 


Representative Inventories 471 


ably to its “babyish” quality.) The inventory is intended to reveal 
psychoneurotic tendencies, the sections purporting to rate ascendance- 
submission, introversion-extroversion, and emotionality. Several spec- 
imen items are: “I have a lot of nerve”; “I like to read before class”; 
“I feel tired most of the time.” After each statement the subject in- 
dicates whether he feels the “same” or “different.” 

The authors of this instrument suggest that it be administered in 
class at the beginning of the school year “. . . in order to acquaint 
the teacher as soon as possible with the personality make-up of her 
children.” This suggestion is not only unwise; it is dangerous. It im- 
plies that personalities can be studied by group or mass testing, that 
this inventory is comprehensive enough to portray pupils’ person- 
alities, that personalities are to be described and diagnosed in terms 
of test indexes, and that teachers are or should be equipped to make 
personality diagnoses and provide remedies for the maladjusted. None 


of these implications is warranted, 


The Minnesota Personality Scale. This inventory, having one form 
for men and another for women, is constructed to rate the following 
aspects of personality: morale (belief in society’s institutions and 
future possibilities); social adjustment (gregariousness and social ma- 
turity); family relations (parent-child relations); emotionality (degree 
of Stability); economic conservatism (degree on a scale from con- 
servatism to radicalism). The inventory is devised for use in the last 
two years of high school, with college students, and “in some adult 
cases.” An aspect of this instrument infrequently found is the grada- 
tion of answers whereby the subject indicates the strength of his re- 
sponses. Instead of the very common yes, no, or ?, the subject in this 
instance has five choices, such as: strongly agree, agree, undecided, 
disagree, strongly disagree; or almost always, frequently, occasionally, 
rarely, almost never. The score of each item is weighted from one to 
five, corresponding to the degree of intensity represented by the choice 
of answer. ay 

Perhaps the only parts of this inventory needing special mention 
are the first and fifth: morale and economic conservatism. Both traits 
are unusual in personality questionnaires. In the first, whereas high 
scores are regarded as indicative of belief in society’s institutions and 


ĉ By J. G. Darley and W. J. McNamara. Psychological Corporation, 1941. 


472 Personality Inventories 


future possibilities, low scores “usually indicate cynicism or lack of 
hepe in the future.” Sample items on morale are: “No one ease nen 
what happens to you.” “Court decisions are almost always just.” Two 
items on the scale of economic conservatism-radicalism are: “On the 
whole our economic system is just and wise.” “Poverty is chiefly the 
result of injustice in the distribution of wealth.” 

The particular selection of the five personality aspects tested by 
the Minnesota inventory may appear to be a rather strange one. The 
authors explain their selection as being“. . . the result of . . . work 
on problems of personality measurement in a clinical personnel pro- 
gram” in the University of Minnesota. The personality aspects sampled 
with this instrument have been found valuable in identifying “. , . a 
substantial proportion of adjustment problems in a large scale student 


personnel program” after a number of traits and attitudes had been ex- 
perimentally investigated. 


The Minnesota Multiphasic Personality Inventory.” We have here 


one of the most elaborate and ambitious instruments in this field. The 
authors state in their Manual that the inventory is “, 


“It is intended for persons sixteen 
able to read. 


» delusions, phobias, 
10us items of the inventory have 


nley. Published by The Psychological 


Representative Inventories 473 


nine are: hypochondriasis, depression, hysteria, psychopathic deviate, 
masculinity-femininity interest, paranoia, psychasthenia, schizophre- 
nia, and hypomania. Since its original publication, scoring has been 
developed for a new scale from the same 550 items. This is called the 
social scale, to measure the tendency to withdraw from social con- 
tacts with others. 

It is apparent from this list of scales that the Multiphasic inventory 
is concerned almost exclusively with the clinical problem of differential 
diagnosis. This is further indicated by the fact that the scales were de- 
veloped by contrasting normal groups with clinical psychiatric cases. 
The chief criterion of validity was the prediction of clinical cases 
against the diagnoses of a hospital staff. 

The 550 items that constitute the whole inventory are in them- 
selves not unusual. The distinguishing characteristics of the inventory 
are: its comprehensiveness; its large number of scale groupings to di- 
agnose clinical types; and four aspects of its scoring, namely a “valid- 
ity score,” a “lie score,” a “question score,” and a “K” score. 

The first is based upon a group of items which serve “. . |, as a 
check on the validity of the whole record. . . . If the [validity] score 
is high, the other scores are likely to be invalid either because the sub- 
ject was careless or unable to comprehend the items, or because some- 
One made extensive errors in entering the items on the record sheet. 
A low [validity] score is a reliable indication that the subject’s re- 
sponses were rational and pertinent.” This score is obtained from sixty- 
four items which have been answered uniformly by about ninety per- 
cent of normal persons and by nearly as many miscellaneous abnormal 
subjects. It has been concluded, therefore, that a marked deviation 
from these uniform responses is an indication of invalidity of other 
responses, for reasons given above. 

The “lie score” consists of fifteen items on which a high score 
Probably indicates that the subject has answered them falsely in order 
to create a favorable impression and thereby place himself in a 
Socially acceptable light. In general, these are statements about one’s 
behavior which, if they apply to the subject (and they do apply to 
Practically everyone), indicate that he is something less than perfect. 
Though a high “lie score” does not necessarily invalidate the other 
Scores, it may indicate that some of them have been influenced simi- 
larly; thus the total result is open to question. There are, however, in- 
stances of individuals whose “lie scores” are not indicative of con- 


474 Personality Inventories 


scious falsifying but are symptomatic of personality trends which might 
need probing; for the subject himself may be unaware of the motiva- 
tional factors producing his false answers. 

The “question score” is the total number of statements placed in 


the Cannot Say category. The authors of this invent 


ory state that a 
high “question score” 


invalidates the others; for it is held that such 
high scores tend to move high deviate scores toward the me 
other words, it appears that persons who ten 
mean (the normal range) more often classify 
whereas a correct classification would produce 
score and rating. Here again the subject wh 
Score” is not necessarily aware of the factors responsible for it. 

The K score is a “correction factor” that is used to obtain increased 
validity of the scales. Application and interpretation of the inventory 
revealed that some normal Persons got highly unfavorable scores in 
certain areas; scores that should indicate abnormality, An analysis 
was made of the items that were marked unfavorably by these normal 
persons (called “false Positives”); these items constitute the K cor- 
rection score. A low K Score is said to indicate that the person was 
excessively severe in evaluating himself: over-candid, very self-critical, 
or exaggeration of minimal symptoms. A high K score, by contrast, 


represents a desire to make a more normal impression: unconscious 
or preconscious defensiveness 


J S s against psychological weakness, or de- 
liberate distortion. The authors of this inventory regard the K score 
as a measure of test-taking attitude; hence, it is a score that gives 
further evidence of t 


he over-all validity of the scores on each of the 
several scales,’ 


an. In 
d to deviate from the 
statements as doubtful, 
an even greater deviant 
o has a high “question 


l findings have not been in complete 


agreement; and some have been negative, 


The authors of this invent 


See P. E. Meehl and S, 
Variable in the Minnesota 


R. Hathaway, “The K Factor as a Suppressor 
Applied Psychology, Vol. 30, 


Multiphasic Personality Inventory,” Journal of 
1946, pp. 525-561, 


Representative Inventories 475 


formity (low reliability) among psychiatrists themselves, who make 
the clinical diagnoses and classifications.” As a result of these findings, 
it may be said, at least, that the MMPI is valuable in facilitating diag- 
nosis and in describing and predicting expected behavior. 

Other published studies report significant agreement between scale 
scores and hypochondriasis, paranoia, schizophrenia, and, in particu- 
lar, depressions. Also, numerous investigations have reported that 
scale scores readily distinguish between pathological groups (undif- 
ferentiated), on the one hand, and normal persons, on the other. In 
other words, the Multiphasic inventory is more effective when its 
validity findings are not affected by the uncertainties and unreliabili- 
ties of current psychiatric classifications.” 


The Cornell Index.'' This inventory has been devised to provide an 
instrument “. . . for the rapid psychiatric and psychosomatic evalua- 
tion of large numbers of persons in a variety of situations.” The index 
“was assembled as a series of questions referring to neuropsy- 
chiatric and psychosomatic symptoms, which would serve as a 
standardized psychiatric history and a guide to the interview, and 
which, in addition, would statistically differentiate persons with 
Serious personal and psychosomatic disturbances from the rest of the 
Population. It was devised as an adjunct to the interview, not as a 
substitute unless an interview is impractical.” '* This questionnaire, 
standardized for males only, consists of one hundred and one items. 
The questions fall into two groups: those differentiating sharply be- 
tween persons with serious personality disturbances (for example, 
“Does worrying continually get you down?”) and those concerned 


©The reader will recall that obtained validity of any test depends in part 
u eliability of the criterion. 

Me See ce McKinley and S. R. Hathaway, “The Identification and measure- 
Ment of the Psychoneuroses in Medical Practice: The MMPI,” Journal, 
American Medical Association, Vol, 122, 1943, pp- 161-167; H. G. Gough, 
“Diagnostic Patterns on the MMPI,” Journal of Clinical Psychology, Vol. 2, 
1946, pp. 23-37; P. E. Meehl, “Profile Analysis of the MMPI in Differential 
Diagnosis,” Journal of Applied Psychology, Vol. 30, 1946, pp. 517-524; S. R. 
Hathaway and P. E. Meehl, An Atlas for the Clinical Use of the MMPI, 

a Press, 1951; P. L. Sullivan and G. S. Welsh, “A 


Niversit f Minnesot 7 
echnigue aa Objective Configural Analysis of MMPI Profiles,” Journal of 
Consulting Psychology, Vol. 16, 1952, pp. 383-388. . 
i By H. A. Weider, H. G. Wolff, K. Brodman, B. Mittelmann, and D. 
Wechsler. The Psychological Corporation, 1944—1949, 


1? Authors’ Manual. 


476 Personality Inventories 


with significant bodily symptoms (for example, “Do you usually have 
trouble in digesting food?” ). The questions are undisguised and often 
extreme (e.g., “Are you keyed up and jittery every moment?” “Are 
you a sleep walker?”). They must be answered either yes or no. 

The authors of the Index report that it has been effective in show- 
ing the presence of anxiety states, hypochondriasis, asocial trends, 
convulsive disorders, migraine, asthma, peptic ulcers, and borderline 
clinical syndromes. It is to be noted that this inventory, unlike the 
Bernreuter, the Minnesota, and others, does not provide separate 
scoring scales and norms for specific personality traits or disorders. 
Its scores for the entire inventory are intended only to assist in dis- 
tinguishing between those having serious personality or psychoso- 
matic difficulties and those not having them. The scoring of the in- 
ventory is to be followed by an interview, or interviews, after which 
the diagnosis may be made. 

The 101 questions themselves have been classified under the fol- 
lowing ten categories, the number of items varying from one to 
another: defects in adjustment expressed as feelings of fear and in- 
adequacy; pathological mood reactions, especially depression; nerv- 
Ousness and anxiety; neurocirculatory psychosomatic symptoms; 
pathological startle reactions; other Psychosomatic symptoms; hypo- 
chondriasis and asthenia; gastrointestinal psychosomatic symptoms; 
excessive sensitivity and suspiciousness; troublesome psychopathy. 


This is distinctly an instrument for clinical use, chiefly for screening 
purposes and expediting diagnosis. 


Two unusual features characterize 
inventory: cut-off scores and “ 
of 1000 cases at military insta 
had been rejected for neurop: 
accepted after psychiatric inte 
(1) the percentage of rejecte 


the scoring of this personality 
stop” scores, both based upon a total 
lations. Of these, 400 were men who 
sychiatric reasons, and 600 were men 
tview. A table of cut-off scores shows: 


€s, at each score level, who would have 
been identified by the scale, and (2) the percentage of those accepted 


after psychiatric interview, but who would have been rejected by the 
Index (Table 48). Thus, a cut-off score of 13 would have identified 
74 percent of those rejected after Psychiatric interview, as well as hav- 
ing rejected 13 percent of the men passed after interview. 

The “stop” questions (e.g., “Were you ever a patient in a mental 
hospital?”) are such as would indicate extreme maladjustment oF 
pathology. The “stop” items are to be used for ready identification of 


TABLE 48 
Per Cent of Psychiatric Accepts and Psychiatric Re- 
jects * Identified as Rejects at Various Cutoff Levels 
of the Cornell Index 


400 Psychiatric 600 Psychiatric 
Cutoff Level Rejects Accepts 
0 100% 100% 
1 99 82 
2 97 67 
3 94 54 
4 3 46 
5 DR 3 
6 90 32 
7 86 28 
8 85 24 
9 83 20 
10 81 18 
11 78 16 
12 76 15 
13 74 13 
14 72 12 
15 68 10 
16 66 9 
17 62 8 
18 61 7 
19 60 7 
20 57 6 
22 53 4 
3 50 4 
7 48 4 
25 4 
%6 42 3 
27 4l 3 
28 40 2 
29 39 1 
30 35 1 
3 34 1 
32 32 1 


* In terms of opinion at psychiatric interview at five inductior 
stations. 
Table reads: 
of those rejecte 
Index; 28% of t 
been rejected by 

Psychologica 


If cutoff score of 7 on Index were used, 86% 
d at interview would have been rejected by 
hose accepted at interview would also have 
Index. From Manual of Cornell Index. 
] Corporation. (By permission.) 


47° SCE MSEPESERE FY SVE MNOS Se 


men who, presumably, are to be considered immediately for re- 
jection. oe dete ts eS 
The efficiency of this /ndex in identifying poor personality risks, 
as shown in Table 48, is great enough to warrant its use for the pur- 
poses stated by its authors, especially in situations where large num- 
bers of persons must be rapidly screened. In situations where such 
pressure does not exist, the /ndex is still useful as a basis for and a 
guide to subsequent psychological interview and to psychotherapy. 
The fact that the Index does not identify larger percentages of 
probable poor risks at some levels, while, at the same time, it rejects 
a number of “psychiatric accepts,” may be due to one or both of two 
factors: (1) some inadequacy of the inventory, and (2) errors in 
psychiatric judgments leading to acceptance or rejection after psy- 
chiatric interview. It is widely recognized that the brief psychiatric 
screening interviews during World War II were not optimally con- 
ducted; and the psychiatric interviewers, in many instances, were very 
inadequately prepared for their tasks. 


The Guilford-Zimmerman Temperament Survey. This is one of 
the most carefully developed inventories available. It presents rather 
full statistical data and psychological rationale, thus enabling the 
prospective user to evaluate the instrument for his needs and to inter- 
pret the results more reliably. 

Ten traits are measured: general activity, restraint, ascendance, 
sociability, emotional stability, objectivity, friendliness, thoughtful- 
ness, personal relations, and masculinity. The inventory is intended 
for use with individuals in grades 9 through 16 and with adults, The 
particular traits included are the products of factorial analyses made 
over a period of years by Guilford and his associates. 

The statements of these authors regarding the values and charac- 
teristics of their inventory are much more restrained 


and psychologi- 
cally cautious than those of some other authors of i 


nventories. 


The SRA Youth Inventory, Differing from other inventories, this 


one contains a check-list of 298 items, subdivided into nine classes: 


By J. P. Guilford and W. S. Zimmerman. Sheridan Supply Co., Beverly 
Hills, California, 1949. 


1 By H. H. Remmers, B. Shimberg, and A. J. Drucker. Science Research 
Associates, 1950. 


a i 


“my school, looking ahead, about myself, getting along with others, 
my home and family, boy meets girl, health, things in general, basic 
difficulty.” It is obvious from this list that the inventory is not devised 
to measure personality traits (except as they might be inferred from 
the scores). The check-list of items is intended to identify areas of 
problems that are worrying junior and senior high school pupils most. 
The obtained results, then, in conjunction with other information 
about each individual, are to be used in counseling and interview, 
perhaps also in instances where therapy is indicated. 

Original materials for the inventory were obtained by having high- 
school pupils write spontaneous anonymous essays on problems of 
greatest concern to themselves. The contents of these were analyzed; 
earlier investigations and inquiries regarding teen-age problems were 
surveyed; and on the basis of these analyses and surveys, a pre- 
liminary list of items was prepared. These items were edited by 
Specialists in psychology and in education, thus deriving the final list 
of items from direct introspection by the persons concerned, from 
comparative studies, and from professional evaluations. 

The manual of this inventory provides full statistical data essential 
to the interpretation of scores. Equally important is the authors’ em- 
phasis upon interpreting scores in conjunction with other relevant 
information about each individual. When all relevant information 
about a particular adolescent is available, it is then possible to esti- 
mate the significance of his responses and to differentiate between 
Problems which are relatively superficial and those which suggest 


basic personality difficulties.” 


Inventory. This is another instrument 
that differs from most others in that it is devised to assess only one 
Pair of opposed personality traits. The authors of this inventory have 
selected these particular traits because they believe that security “is 
almost synonymous with mental health.” Security is defined, essen- 
tially, as feelings of being liked, loved, and accepted; of belonging 
and having a place in the group; of safety and of being unanxious. 

The S-I Inventory is intended for use with groups for research and 


15 A similar type of device is the Mooney Problem Check List (1950), 
Published by the Psychological Corporation. It does not yield any scores, being 
intended to facilitate and expedite psychological interview and counseling. 

16 By A. H. Maslow et al. Stanford University Press, 1952. 


The Security-Insecurity 


480 


Personality Inventories 


TABLE 49 


Information on Reliability and Validity of Certain Inventories 


Inventory 


Allport A-S Reaction 
Study 

Bell Adjustment 
Inventory 


Bernreuter Personality 
Inventory 


California Test of 
Personality 


Aspects of Personality 


Minnesota Personality 
Scale 


MMPI 


Cornell Index 


52-9 


Reliability 


-85-.90 (split-half) 


78 _ (retest) 


75-97 (retest) 
80-.89 (odd-even) 


-78—.92 (split-half) 


51-.97 (for part- 


scores ) 


.80-.96 (for total 


scores) 
(retest) 


N 


.90 (odd-even ) 


.56—.90 (retest: 


normal 
subjects) 


52-89 (retest; 


psychi- 
atric pa- 
tients) 


95 (Kuder- 
Richard- 
son for- 
mula) 


Validity Criteria 
occupational groups 
ratings of judges 
other inventorics 
ratings of judges 
item analysis 
differentiation of 
extreme groups 

other inventories 

differentiation of 
extreme groups 

low intercorrclations of 
part-scores 

later clinical findings 

experts’ judgments of 
item appropriateness 

item analysis 

item analysis 

behavior ratings 

item analysis 

extreme groups 

known groups of ad- 
justed and mal- 
adjusted students 

diagnosed psychiatric 
groups 

normal subjects 

amount of score over- 
lap between nosolo- 
gical groups 

differentiation be- 
tween unselected 
patients and normal 
persons 

neuropsychiatric cases: 
after interview 

normal persons, ac- 
cepted after inter- 
view 

civilian groups 

other inventories 

distribution of scores 
in a college popu- 
lation 


Representative Inventories 481 


TABLE 49 (Continued) 
Information on Reliability and Validity of Certain Inventories 


Inventory Reliability Validity Criteria 
Guilford-Zimmerman 75-85 low correlations be- 
Temperament tween traits 


factor analysis to 

isolate traits 
T85 spontancous produc- 

tions of subjects 
(teen-agers) 

biserial correlations 

intercorrelations of 
part-scores 

professional judgments 
of diagnostic value 
of cach item 

scores of known groups 


Survey 


SRA Youth Inventory 


Security-Insecurity .8+ (retest) other inventories 
Inventor j .86 (odd-even) self-estimate of sub- 
# jects 


known groups 

systematic analysis of 
syndromes, security 
and insecurity 


Survey purposes, and for the screening of college students who may 
be in need of psychological counseling or of therapy. 


Summary Data. The foregoing instruments were described as rep- 
resentatives of many in the same category. They were presented, as 
well, for the purpose of providing the reader with a conception of 
types of items included and of the personality traits being evaluated. 
Of the personality inventories developed specifically for use in the 
armed forces during World War II, no specimens or descriptions have 
een given here because in essentials of conception, construction, and 
Content they are not different from those in civilian use before the war. 
In fact, many previously published tests were employed in the armed 
forces," Results reported on the effectiveness of these inventories in 
the armed forces indicate somewhat greater validity and usefulness 
in some instances than do similar data for inventories used in civilian 


1 See J. P, Guilford, editor, Printed Classification Tests, Report No. 5 (U. S. 
Government Printing Office, Washington, 1947), Chapters 22-27. 


482 Personality Inventories 


situations. The principal reasons for this seem to be greater motiva- 
tion, screened populations, knowledge of both test results and of 
the subjects on the part of the judges, and specially devised inven- 
tories for specific situations. 

Table 49 presents in outline form information relevant to the 


reliability and validity of each of the inventories described in this 
chapter. 


BIOGRAPHICAL DATA QUESTIONNAIRES 


With this type of inventory, the purpose is to sample the sev- 
eral areas of a person’s experiences that appear to be associated, di- 
rectly or indirectly, with behavior and success in a Particular occupa- 
tion or situation. These aspects may include the intellectual as well 
as the nonintellectual: for example, education, 
occupations, interest in athletics, marit 
interests, specific skills, group affi 
The development of a biogr 
analysis and experimental det 
are most significant. 

Items of a questionnaire are intended to reve 
tudes and adjustments in terms of si 
has participated. This approach is not an attempt to me 
sonality traits. Its major premise is that past beh 
tivities, habits, skills, attitudes, etc. 
expected in the future. Actually, 


religious activities, 
al record, financial status and 
liations, family relationships, etc. 
aphical questionnaire involves a job 
ermination of biographical areas that 


al the subject’s atti- 
gnificant situations in which he 
asure per- 
avior, interests, ac- 
» are indicators of what may be 
biographical data questionnaires are 
Variations on the familiar application blank, including a wide range 
of information, with answers to be given in multiple-choice form. 

During World War II, this technique was employed in the armed 


forces in an effort to improve the selection of personnel for various 
training assignments,'* The results were 


** See D. B. Stuit, op. cit., passim.; and J. P. Guilford (ed.), Printed Classifi- 
cation Tests, Chapter 27, 


Biographical Data Questionnaires 483 


A. Manufacturing industries (machine operator, factory hand 
textile worker, etc.) > 
B. Technical trades (baker, electrician, radio repairman, etc.) 

C. Transportation and communication (truck driver, linesman, 


deckhand, etc.) 
D. Business trades (store clerk, salesman, agent, window dresser, 


etc.) 
E. Public service (fireman, policeman, forest ranger, soldier, etc.) 
A recent attempt to devise a biographical questionnaire for selec- 
tion of educational administrators sampled information in the fol- 
lowing areas: childhood and early background; professional prepara- 
tion; health; interests; early signs of leadership; heterogeneous items.” 


Two sample items follow: 
During most of the time before you were 16 you lived: 


with both parents 
with one parent 


. With a relative 
with foster-parents or non-relatives 


in a home or institution 
How often were you a leader of your childhood “gang” activities 
up to the age of 12 years? 


moop> 


always 
frequently 
occasionally 


seldom or never 
never a member of a gang; or can’t remember. 


moow> 


While this investigation proved of little value for the selection of 
school administrators, it is reported here as representative of the ap- 
Proach to the problem and of the early stages of a technique which 
may prove fruitful for purely practical purposes. 

The biographical data method has within it a very serious potential 
defect not found in other types of inventories. It is a purely empirical 
Procedure, in which an attempt is made to find the predictive value 
of each biographical datum with respect to “success” in a specific 
Position or in a type of occupation. Since this type of questionnaire 
does not deal with abilities, skills, and basic personality traits, and 


| J.P. Guilford and A. L. Comrey, “Prediction of Proficiency of Administra- 
tive Personnel from Personal-History Data,” Educational and Psychological 


Measurement, Vol. 8, 1948, PP- 281-296. 


84 Personality Inventories 


ince it is validated against job performance ratings and against 
progress” of individuals in an organization, the predictive signifi- 
ance of biographical data is very largely a matter of the standards, 
alues, and biases of raters or of the particular organization. For 
xample, it may be found that being “rural-born” (rather than “city- 
orn”) has a fairly heavy positive weight in predicting progress to a 
chool superintendency, or to an executive post in industry. This 
situation would indicate nothing about the subject’s abilities for the 
Dosition nor of his personality traits, nor of the actual demands made 
by the position itself. Such a finding would, however, assist in an 
analysis of the values and biases of Prospective employers, 


TESTS OF ATTITUDES AND VALUES 


Characteristics. An attitude is a dispositional readiness to 


respond to certain situations, persons, objects, or ideas in a consistent 
manner which has been learned and h 


of response. An attitude has a well-de 
example, one’s views regarding 


iquors), sports, mathematics, or Democrats are attitudes. If, how- 


avior is described as self-sacrificing, 
site of these—some of his traits are 


lumerable attitudes, 
Basically, tests of atti 
loyed by Thurstone fo 


lethods of Scaling, Thurston 


e’s technique of scaling attitude tests 
known as the method of equi 


al-appearing intervals. The method is 
““L. L. Thurstone, Scales f 


or the Measuren 
niversity of Chicago Press, 


1930; “The M, 
urnal of Abnormal and Social Psychology, 


nent of Social Attitude, Chicago: 
casurement of Social Attitudes, 
Vol. 26, 1931, Pp. 249-269. 


Tests of Attitudes and Values 485 


essentially this. Statements, both favorable and unfavorable, bearing 
on a particular problem, question, or institution are obtained from a 
group of selected writers, other experts, and laymen. These statements 
are edited. Then they are classified by a large number of judges on 
an eleven-point scale. This is done by placing each statement in one 
of eleven piles, presumably forming a continuum, according to degree 
of favorableness or unfavorableness of each item with respect to the 
question at hand. The median of the judged locations for a statement 
(item) is its scale value. Statements which are judged to be ambiguous 
Or irrelevant to the continuum are eliminated. 

Before inclusion in the final scale, each question is analyzed for 
consistency with the general attitudes found by the total scale. For 
example, on a scale to determine attitudes toward churches, if it is 
found that many persons having an unfavorable attitude check a 
Statement that is apparently favorable, then that item is considered 
irrelevant and is discarded. Statements having approximately the 
same values in the scale should show high consistency in degree of 
endorsement by each subject. This is essentially a simple method of 
item analysis. Ambiguity of an item is determined by the spread or 
Tange of judges’ ratings in the original eleven-fold scale, given in terms 
of Q (quartile deviation). If an item’s Q is “high,” it is eliminated. 

In taking an attitude test scaled in this manner, the respondent 
checks those statements with which he agrees, his score being the 
Median of the scale values of the items he has marked. Thurstone 
held that scales constructed for different attitudes by this method 
Permit direct comparison of the scores of any attitudes so measured. 
The validity of such comparison, however, has been questioned be- 
Cause the defined “neutral points” of different attitudes are not neces- 
sarily the same. Nor are the intervals demonstrably equal; they are 
Only equal appearing. The Thurstone method is useful if strict com- 
Parability of scores is not assumed. , 

Thurstone and his students developed a series of scales, each 
Consisting of statements from extremely favorable to extremely un- 
avorable. The topics included in these scales dealt with attitudes— 
among others—toward Negroes, Chinese, war, censorship, the Bible, 
Patriotism, and freedom of speech. The following statements are from 
the scale of attitudes toward the church.” The scale value of each 


2! L, L. Thurstone and E. J. Chave, The Measurement of Attitude, Chicago: 
University of Chicago Press, 1929. 


86 Personality Inventories 
4 


statement is given in parentheses, low values being favorable and high 
values unfavorable, with a possible range from 0 to 11. 


I find the services of the church both restful and inspiring. (2.3) 
I think the church is a Parasite on society. (11) 


I believe what the church teaches but with mental reservations. 
(4.5) 


I think the teaching of the church is altogether too superficial to 
have much social significance. (8.3) 


I believe in religion but I seldom go to church. (5.4) 


I believe the church is the greatest institution in America today. 
(1.7) 


Likert suggested the use of an attitude-scoring technique which is 
simpler than the Thurstone method and is 


agree, agree, undecided, disagree, or Strongly disag 
disapprove may be used in place of agree-disagree), Arbitrary scoring 
weights of 1, 2, 3, 4, 5 were assigned for the respective responses, An 
individual's score on a particular attitude scale is the sum of his ratings 
on all items. The principal advantage of Likert’s method, obviously, 
is that it makes unnecessary the use of a group of judges to arrange 
Statements into categories representing degrees of favorableness_ or 
is are selected on an a priori 
e scoring weights are arbitrarily assigned, the use 
of the Likert method, like the Thurstone, measures attitudes only in 
the sense that individuals are given a rank order according to attitude 
intensity,“ 

The following items from the Minnesot 


a Personality Scale (for 
men) are examples of the technique sugges 


ted by Likert. 


SA., A. U., D.,SD. On the whole lawyers are honest. 


The future looks very black. 
gi Education only 


2? See R. Likert, A Technique for the 


Measurement of Attitudes, Archives 
of Psychology, Vol. 22, No. 140, 1932, 


Tests of Attitudes and Values 487 


persons, or institutions. The scales deal, among others, with national 
and racial groups, vocations, teachers, social action, and school 


subjects.** 


Tests of Values. A test of values, by contrast with one of attitudes, 
purports to measure generalized and dominant interests, The Study 
of Values,” for example, is based upon six types of values, as classified 
by Spranger. Their items are intended to measure the relative promi- 
nence of the subject’s interests, for the purpose of classifying his 
values. The six types are: theoretical, economic, aesthetic, social, 
Political, and religious. According to this classification, the dominant 
interest of the theoretical man is discovery of truth; the economic man 
is interested in what is useful; the aesthetic man values most form 
and harmony; the highest value of the social type is love of people; 
the political man is interested primarily in power; and the religious 
man places highest the value of unity in an effort to comprehend the 
Cosmos as a whole. This test of values presents forty-five problem 
Situations, under each of which the subject is required to select—from 
Paired alternatives or from multiple choices—responses which are 
indicative of degrees of the six types of interests and values. For 


example: 
The main object of scientific research should be the discovery of 


truth rather than its practical applications. (a) Yes; (b) No. 
Do you think that a good government should aim chiefly at—[{The 
Ollowing statements are to be ranked in order of preference.] 


(a) more aid for the poor, sick, and old 
(b) the development of manufacturing and trade : 
(c) introducing highest ethical principles into its policies and di- 


lomac: > ; 
(d) SIR e a position of prestige and respect among nations 


It is not to be assumed that these six are “natural” types, or that 
they include all possible value groups, or that individuals can be 
classified entirely under one or another. As a matter of fact, most 
Persons are a mixture of two or more of these value groups, some 
Values being stronger and more dominant than others in each person. 

e six classifications were employed as starting points for the in- 
3 H. H. Remmers, Attitude Scales, ‘Division of Educational Reference, 


Lafayet; iina: University, 1934-1936. 
aB O eons ie E. Vernon, and G. Lindzey. Boston: Houghton 


aR 
Mifin, 1931-1951. 


488 Personality Inventories 


vestigation of complex views of life which, among others, serve to 
give unity and purposefulness to the mature person. 

An interesting and different approach to the study of values of 
high-school and college students and adults is found in the Sims SCI 
Occupational Rating Scale.” This scale is devised “to reveal the level 
in our social structure—i.e., the social class—with which a person 
unconsciously identifies himself.” The scale lists forty-two occupa- 
tions, representing all levels of socio-economic status. The subject 
indicates whether persons following each of the occupations generally 
belong to the same, or to a higher, or to a lower social class than he 


does himself. Sims states that “. , . by examining the occupations 
which the subject indicates are those 


own social class, we are able to determi 
signs himself in our society.” 


economic group usually signifies acceptance of t 
that group. 


as reliability is con- 
ian reliability coeffi- 
and even below .50, 
, the coefficients were 
scoring (Thurstone’s 
ated (about 90), as 
esponses were scored 


Validity of tests of attitudes and values is extremely difficult to 
determine in a statistical manner, since the only behavioral criterion 
is people’s actions. Obviously, it is Practically impossible to obtain 
objective behavioral data on a population sampling with regard to 
such matters as attitudes toward church, foreign-born, Specific minor- 
ity groups, education, and the like.” Furthermore, overt behavior need 

"By V: Mi Sims. World Boek Co, 1932. SCI stands for “scad daw 
identification.” 

2% A. L. Edwards and K. C. Kenney, “ 


Likert Techniques of Attitude Scale Co; 
chology, Vol. 30, 1946, pp. 72-83. 

27 Cf. S. M. Corey, “Professed Attitudes and Actual Behavior,” Journal of 
Educational Psychology, Vol. 28, 1937, pp. 271-280; R. T. LaPiere, “Attitudes 
vs. Actions,” Social Forces, Vol. 13, 1934, pp. 230-237. 


A Comparison of the Thurstone and 
nstruction,” Journal of Applied Psy- 


Opinion Polling 489 


not always be correlated with attitude scores. For example, consider 
two persons, one of whom is extremely hostile to churches while the 
other is indifferent to them. Their scores on the scale will differ sig- 
nificantly; but the indifferent person might attend and support a 
church as little as the hostile one. On the other hand, another indif- 
ferent person might attend regularly for social or economic reasons. 
Similarity of manifest behavior may be true, also, of one who is 
hostile to the foreign-born and one who merely avoids them. The fact 
that attitudes and overt behavior need not correspond makes valida- 
tion, in the usual terms, a near impossibility. It is reasonable to con- 
clude, however, that if individuals make a genuine effort to respond 
according to their own attitudes, these scales are useful in evaluating 
the beliefs of the respondents, as of the time the responses are given. 
Whether retest scores will show significant changes will depend upon 
each respondent's experiences, including educational, in the interim. 

Scales of attitudes and values have had little use in clinics or in 
educational testing programs. They have been utilized principally 
in investigations in social and educational psychology. 


OPINION POLLING 

ears opinion gauging has become in itself a spe- 
actice. The most prominent activities 
predictions, especially regarding presi- 
lling has been concerned also with a 
with social, economic, international, 
d also with questions of consumer 


In recent y 
cialized field of study and pr 
have been in making political 
dential elections. But opinion po 
great variety of subjects dealing 
military, and other questions, an 


Preferences (usually called market research). l , 
While some opinion studies use several questions in surveying a 


given issue, many employ only a single question. The question may 
be given in one of several forms: some require merely a “yes” or 
“no” answer; some require a rating of intensity or degree, sucli as 
“strongly approve,” “approve,” etc., or “very much, much, ete.; 
at times the respondent is asked to check or rank items ina given 
list; sometimes the respondent classes one of two alternatives; oc- 
Casionally the question is of the “open-end type, in which the 
Tespondent completes a statement or sentence to suit himself. 

All replies are classified into two or more categories, intended to 
answer the purpose for which the survey is planned. Usually the per- 
centage of responses in each category is found. The responses in the 


490 Personality Inventories 


several categories are regarded by most opinion gaugers as represent- 
ing quantitative or qualitative differences in attitude with respect to 
the question being investigated. 

The mailed questionnaire, which had been in use long before 
opinion polling became popular, is another form of opinion gauging. 
This form of questioning, however, presents several serious disadvan- 
tages, so that it is not widely used at present. Representative mailing 
lists are difficult to obtain or develop; the percentage answering the 
questionnaires is often small and atypical of the total group; the 


questions may not be understood or correctly interpreted by persons 
at the lower end of the intelligence scale. 


Opinions are obtained from a sampling of persons who are regarded 


ulation. This population may be 
: for example, college students, or 
» Or housewives, or all the voters 


ulation to be polled has been de- 
fined, it may be sampled by one of three most common methods: (1) 


random sampling; (2) Stratified sampling; (3) area sampling. 
Random sampling simply means selecting respondents by “chance,” 
with the expectation that each individual in the defined group has an 
equal chance of being chosen, Hence, it was assumed that random 
selection of a large enough number of individuals would yield a 
representative sampling of the entire group. It has been shown, how- 
ever, that very often completely random selection does not yield a 
representative group. Precautions must be taken to avoid obtaining 
a biased sampling; and the investig, 
obtain a random sampling within t 


ing does not insure that a repre- 
ained. Consequently, purely 

random s ing si quently used at present. 
When stratifi ing i e population to be studied is 
or strata), from each of which 
wn. For example, it might be 


classified into two or more categories ( 
a sample of proper proportions is dra 
desired to divide the Population into two groups: high-school grad- 
uates and non-high school graduates. The Proportions of each in the 
general population would have to be determined: then the same pro- 


28 
from each group in the investigation. 
esired, such as completion of six grades, 


2S Tn some instances, 


for well-defined reasons, disproportional representation 
is warranted. 


Opinion Polling 491 


and nine, and twelve, and college graduation. In other situations, it 
might be necessary to obtain respondents who satisfy specified pro- 
portions with regard to sex, age, economic status, native or foreign- 
born, etc., or combinations of two or more of these. It is obvious that 
the problem of stratified sampling becomes more complex and diffi- 
cult as the number of characteristics is increased. The purpose of 
the investigation determines the characteristics upon which the sam- 
pling shall be “stratified.” The population to be studied is then 
analyzed for these characteristics and sample respondents are drawn 
accordingly. 

Using the method of area sampling, the geographic area to be 
investigated is divided into a number of subdivisions of approximately 
equal population. The characteristics of the population in the various 
areas are known, so that in the final selection, individuals of the de- 
sired characteristics will be included in correct proportions. The 
Particular divisions to be studied are chosen so as to yield a sampling 
that will be as large and diversified as may be necessary for the in- 
vestigation. Each of the divisions thus included is further subdivided 
into much smaller units, within each of which are only a few persons 
who have the desired characteristics of those to be polled by interview. 
Then from among these small subdivisions, a random sampling is 
Made for actual study. Within each of these final units—chosen by 
Means of random numbers—every person is interviewed, even if 
Tepeated visits to the homes are necessary. 

The area sampling method is infrequently used because it is so 
expensive; it requires that relevant information be obtained regarding 
every family within specified areas and that extremely elaborate files 
be kept. It is very doubtful whether the advantages of the area sam- 
Pling method are such as to warrant the cost and the attempts to 
obtain information of kinds that many people will, with justification, 
Tegard as an invasion of their privacy. a, 

Polling procedures and results are too often of doubtful reliability 
and validity because of difficulties inherent in the practice itself and 

cause some of the problems have been neglected or inadequately 
dealt with, The major problems and difficulties will be indicated with- 


Out elaboration. 


With any method of sampl 
pletely unbiased sampling b 
known selective factors, and errors of 
and responses of some persons. 


ing, it is extremely difficult to get a com- 
cause of unknown chance errors, un- 
judgment in evaluating traits 


492 Personality Inventories 


It is doubtful if opinions can be correctly gauged by a single ques- 
tion, except in special instances such as elections, when the respond- 
ent makes a choice between candidates. 

Answers to questions may be deliberately falsified or there may be 
lack of frankness which results in a large number of “undecided” or 
neutral answers. 

All respondents do not necessarily interpret a question or statement 


in the same way; the same question or statement has varying con- 
notations for different persons. 


Some respondents do not underst 
the question or statement. 


Some respondents do not know or understand the issue being dealt 
with. 


Individuals who do not actual 
Strained to express one anyhow. 


Over-all percentages of responses vary with different ways of stat- 
ing a question. 


When the open-end question is use 
sponses. 


Responses are influe: 
bias of the interviewer. 
Significant differences in Status between interviewer and respond- 
ents may influence answers: €-8., responses of Negroes to Negro 
may differ from those of Negroes to white, 


Verbal expression of an opinion does not necessarily indicate the 
respondent’s actions. 

Different persons may have different reasons for giving the same 
responses; and persons giving different responses may do so for a 
common reason. Psychologically, it is necessary to determine, 
through skillful interview, the reasons for a response. 


and the language or phrasing of 
ly have an opinion feel at times con- 


d, it is difficult to classify the re- 


need by the training, influence, and possible 


Some of these difficulties and problems can be met in part if the 


questions are stated unambiguously and simply and are easily under- 


stood by persons for whom intended; if the respondents are familiar 
with the issue; 


if the respondents could reply by “secret ballot” Of 


could surely remain anonymous; if more than one question is used for 
a given issue. 


As psychologists, however. 
why individuals hold certain 
polled, rather than to learn 


Evaluation of Personality Inventories 493 


States? Why are some individuals hostile to persons of Oriental origin? 
Determination of reasons for these and even more subtle behaviors, 
attitudes, and values requires interviewing by qualified psychologists 
or other qualified professional persons. So long as opinion polling 
remains a matter of classification of responses and determination of 
percentages, it remains a statistical problem of significance, aside 
from technical considerations, principally to political scientists, so- 
ciologists, and consumer research specialists.” 


EVALUATION OF PERSONALITY INVENTORIES 


Reliability and Validity. The authors of most of these instru- 
ments present reliability coefficients which are reasonably satisfactory, 
Or even quite high, being for the greater part .80 or more. In some 
instances the reliabilities falling below .80 or even below .70 are un- 
Satisfactory so far as predictions of subsequent behavior in individual 
cases are concerned. The methods used are the usual ones: namely, 
correlation of odd-even scores and of test-retest scores. 

It is in respect to validity, however, that personality inventories as 
a class present the greatest difficulties and are most vulnerable to 
criticism, Determination of validity is certainly difficult; yet that 
must be the most essential requisite of a useful instrument. 

In devising their personality measures, the earlier authors began 
With questions or statements, gathered from a variety of publications 
and sources (clinics, schools, colleges, industry and business, home, 
community), that are symptomatic of neurotic disorders, behavior 
difficulties, or of normal behavior manifestations. The nature and 
Scope of the items depend, of course, upon the age-levels and pur- 
Poses for which a particular inventory is intended. Authors of later 
instruments followed the same practice, often utilizing items from 
Several earlier devices, recombining them, and adding some new ones. 

Methods, The following methods and criteria have commonly 
been used in studying validity: (1) statistically significant differences 
between average scores of clinically well-defined groups; (2) signifi- 
Cant average score differences between clinical groups and a normal 
Population; (3) ability of each item to differentiate between the two 
extreme groups in the standardization population; (4) internal con- 
pior a comprehensive eritiqus ogc Bulletin, Vol. 43, 1986, PP 


“Opinion-Attitude Methodology, 
289-374. 


Personality Inventories 
494 ) 


sistency of items or parts; (5) comparisons of inventory scores with 
judgments of counselors and school officials; (6) selection of items 
from other published tests and correlations with these instruments; 
(7) factorial analysis; (8) the author’s own judgment regarding man- 
ifestations which constitute evidence of a specified trait. 

(1) The first criterion should be employed only when an inventory 
is designed primarily for clinical use in the diagnosis of personality dis- 
orders; as, for example, in the case of the Minnesota Multiphasic 
Personality Inventory (depression, hysteria, Paranoia, etc.). Meas- 
ures standardized on this criterion cannot and should not be used 
for the study of a normal population, except for the purpose of 


Screening out, for further clinical study, individuals at the extremes 
of maladjustment, 


(2) The second criterion 


is likewise used with measures that are 
intended chiefly for clinical 


Purposes, but in this instance the emphasis 


me groups of a distribution 


al-reactionary or ascendance- 
submission), as shown by the percentage at each extreme answering 


the item in a Specified manner. A test So validated should not be 
representative population because it is not neces- 
sarily adequate to differentiate among the great percentage of persons 
who are located between the extremes. 
(4) The fourth, internal Consistency, differs from the preceding 
criterion in that each item is correlated against Part scores for all 
subjects, the Purpose being to learn whether answers to the individual 


items are, on the whole, reasonably consistent with the behavior oF 
personality trends Suggested by 


“face validity”; for b 


€ncy will be sought only in an effort 
© highest validity coefficients against 


Evaluation of Personality Inventories 495 


(5) The fifth criterion, employed in constructing inventories to be 
used principally in schools, assumes that the obtained judgments 
have adequate validity and that the judges are competent to assess 
Personality traits as well as desirable and undesirable forms of ad- 
justment. In some instances the assumption is warranted; in many it 
Js not. 

(6) The sixth criterion assumes that items and tests already in use 
are themselves valid. This is a practice which in many instances is 
Not justified and which tends to perpetuate inadequacies, errors, and 
misconceptions inherent in the older inventories. 

(7) Factor analysis, the seventh criterion, has in the minds and 
work of some investigators come to take the place of validation 
against behavioral and psychological analysis. These investigators as- 
semble a number of items, administer the inventory to a standardiza- 
tion population, statistically analyze the scores, group the items into 
a number of categories, and give the categories names of traits that 
appear to be measured by means of the items that they decided, more 
Or less arbitrarily, should go into the inventory in the first place. This 
is a form of circular reasoning that does not add to the construction 
of valid instruments.” Actual behaviors of defined groups of persons 
must be the ultimate criteria of validity of practically all personality 
inventories." For personality traits derive their ultimate significance 
from the role they play in advancing or retarding people’s personal 


and social adjustment. ; , 
(8) In using the eighth criterion, an author simply selects or devises 


3 A valuable criticism of factor analysis in the study of personality is to be 
found in G. W. Allport, Personality. pp. 242 ff. On the techniques of factor 
analysis, see J. P. Guilford, Psychometric Methods, New York: McGraw-Hill 
Book Co. 1936, Chap. 14. As an example of an extreme position on factorial 
i i traits, see R. B. Cattell, Personality: a Systematic Theo- 
dv, New York: McGraw-Hill Book Co., 1950. Factorial 
y achieving a significant degree of simplification 
OF integrati ersonality traits. One publication brings together comparable 
analyse A a with forty-nine personality factors. See J W. French, 

te Description of Personality in Terms of Rotated Factors, Princeton, N. J.: 

ducational Testing Service, 1953. Also, R. B. Cattell et al., The Objective 

€asurement of Dynamic Traits.” Educational and Psychological Measure- 


ment, Vol 5 . 224-248. , 
S The a ereer to this statement are the values and attitudes 
Bosse values and attitudes, as measured 


Scales, As ined, one’s expressed 
s. As already explained, one s I ul 
On a scale may ie penoine; yet for other, overbalancing reasons his actions 


May not be consistent with them. 


analyzed personality 


retical and Factual Stud A 
analysis has not yet succeeded in 


96 Personality Inventories 
4 


items to suit his own definition of a trait or a theory of personality, 
without concern thereafter over their behavioral or statistical validity. 
Starting with theories and definitions is, of course, desirable; but the 
validation process must go beyond that stage. 

Research Findings. Itis not to be assumed that every test of per- 
sonality, or many of them, for that matter, have been validated ac- 
cording to all or even several of these criteria. Most have been sub- 
jected to validation by the method of internal consistency, or correla- 
tion with earlier tests, plus, in some instances, the use of known 
groups, in one form or another. Validation data obtained by correla- 
tions of internal consistency have yielded the most impressive results. 
But this is readily understandable, because items can be retained, 
modified, or eliminated so as to create the desired internal rel 
ship without, however, giving any assurance that the specified traits 
are actually being measured. Results obtained by intercorrelations 
of inventories among themselves, though high or moderate in some 
instances, have not, on the whole, been very satisfactory., Poorest re- 
sults have been found in studies of validation against competently 
determined group classifications (known groups). Yet it is this 
method which is the most significant and crucial one. Using this 
criterion, experimental studies have yielded contradictory results.” 
For example, of nine investigations to validate personality question- 
naires with groups of behavior-problem children, the number of cor- 
relation coefficients of various sizes were as follows: 


ation- 


two above .70 

one between .40 and .70 

six below .40 
In 75 validating studies, corre 
(diagnosed neurotics and psy 
lowing coefficients were obtai 


lating scores of normals and abnormals 


chotics) with selected criteria, the fol- 
ned: 


thirty-six above .70 
nine between .40 and -70 
thirty below .40 


32 See A. Ellis, “The Validity of Personalit 
Bulletin, Vol. 43, 1946, PP. 385-440. This is a comprehensive summary O 
published studies on the subject and a valuable criticism of the most common 
practices. See also, by Ellis, “Personal 


ity Questionnaires,” Review of Educa- 
tional Research, Vol. 17, No. 1, 1947, Chapter 4. 


y Questionnaires,” Psychological 


Evaluation of Personality Inventories 497 


When inventory scores were validated against ratings by teachers 
friends, or associates, the findings were: ; 


twelve above .70 
ten between .40 and .70 
twenty-two below .40 


Validation studies of four group inventories (Bell, Bernreuter, 
Thurstone Personality Schedule, Woodworth Personal Data Sheet) 


yielded the following results: 


twenty-five above .70 
eleven between .40 and .70 
forty-four below .40 


More consistent and convincing results were obtained when per- 
sonality tests—principally the Minnesota Multiphasic—were admin- 
istered individually rather than to groups. The validity coefficients 


were: 


ten above .70 
three between .40 and .70 


two below .40 


These Jast data suggest that individual testing of personality is 
Superior because subjects may be more highly motivated, due to 
Clinical rapport; the inventories are more carefully developed; their 
Uses are more limited and more clearly defined. 

Somewhat more than half of the coefficients reported above (.40 
and higher) are either quite high or moderate as validating data. And 
Somewhat fewer than half are quite low (below .40). Although co- 
efficients below .40 or .50 do not have very much predictive value 
for all individuals within the group, they may, nevertheless, indicate 
that the inventory has considerable value in identifying individuals 
Who constitute the more deviant groups. 

The differences found among the large number of studies sum- 
Marized cannot be attributed to the inventories alone. Other factors 
to be considered are: the number of subjects, their homogeneity, and 
their classification; the soundness of the ratings or of the clinical diag- 
Noses that are used as validity criteria; the purposes for which and 
the conditions under which the inventories were administered. 


8 Personality Inventories 
49 


These findings indicate that inventories for the assessment Gt per- 
sonality traits should not be used indiscriminately or uncritically; Pa 
should the sounder among them be rejected uncritically. Personality 
inventories are more valuable for certain defined populations than 
for others; they are more valuable in some kinds of situations than 


in others. One comprehensive survey of published studies between 
1946 and 1951 reports “, 


discriminate signific 
somatic, alcoholic, 

they usually do not 
with vocational, 
groups.” 33 


. . that in most cases inventory scores 
antly when used with Psychoneurotic, psycho- 
age, sex, ethnic, and college groups. . . . and 
give significant group discriminations when used 
academic, Socio-economic, and disabled and ill 


Factors Adversely Affecting Validity Findings. 


Many reasons have 
been offered in explanation of the equivocal a 


nd less satisfactory 
Tous published studies, The major 


The questionnaires sample se 
bring out the whole pa ganismic, representation of be- 
havior. Personality cannot b i i f 
a mere summation of traits. None of the avail 
rating scales, and personal-history records are able to portray the 
personality as a complete, dynamic, organized whole. They measure 
—not very precisely—certain aspects of behavior. They do not 
actually measure or assess the unit (the organism) that does the be- 
having. 

Many studies of validity have been devoted to correlations with 

agnoses as criteria; but the psychiatric descriptions and 
always clearly defined or sufficiently distinct; 


able questionnaires, 


In testing for traits common to a population, attention may be di 
verted from the individual as a unit to the 


3 assignment of a mere rank 
or index to a segment. 


* A. Ellis, “Recent Research with Personality Inventories,” Journal of 
Consulting Psychology, Vol. 17, 1953, pp. 45-49, 

“Cf. A. Ellis, Psychological Bulletin, 1946, loc. cit. Also, R. M. W. 
Travers, “Rational Hypotheses in the Construction of Tests,” Educational and 
Psychological Measurement, Vol Li. 1951, pp. 128-137; J. C. Flanagan, “The 
Use of Comprehensive Rationales in Test Development,” ibid., pp. 151-155: 
C. E. Osgood and G. J, Suci, “A Measure of Relation Determined by Both 
Mean Difference and Profile Information,” Psychological Bulletin, Vol. 49 
1952, pp. 251-262. 


Evaluation of Personality Inventories 499 


Some inventories purporting to measure two or more separate 
traits are measuring largely the same trait under more than one name. 

Differences in cultural factors will cause subjects to respond dif- 
ferently to the same question. 

A given question or statement does not have the same meaning for 
all subjects, even when clearly stated. It is a fallacy to assume that all 
Persons have similar reasons for giving similar responses to an item. 

Misunderstandings of questions are due to vocabulary limitations 
of some respondents. 

Many questions cannot be answered in the yes, no, ? form. 

There is a general tendency for some subjects to overrate them- 
Selves (“self-halo”). 

Almost anyone can falsify his replies to a questionnaire; and an 
indeterminate number do so." 

Some subjects lack insight into their traits; others fundamentally 
and unconsciously may be different personalities from their own 
conscious self-appraisals. 

The scoring of answers to items is often based upon the test au- 
thor’s own personal judgments and set of values. 

On some questionnaires, either very low or very high scores, or 
both, may be significant; but the wide middle range of scores may not 
be at all meaningful for differentiation and description. 

Statistical assumptions and procedures often take the place of be- 


havior analysis and psychological insights.*° 


Positive Factors. On the positive side, the following points may be 
made. 
Personality testing is a relatively recent practice and is still in proc- 


€ss of development. ; ; 
Efforts to develop measures of personality traits encourage greater 


Uniformity in and precision of trait definition and description. ; 

When there is essential agreement in regard to definition of traits 
and terms, and in regard to behavior and symptoms, the use of stand- 
ardized inventories increases the objectivity of personality ratings 


and descriptions. l ni 
The cen personality measures encourages analysis of traits into 
their constituent elements, thus providing a better understanding of 


y College Students of a Prescribed Pattern on a 


35 ues eat b 
Ne EL ‘Noll, SSinalanen logical Measurement, Vol. 11, 1951, 


Personality Scale,” Educational and Psycho! 


Pp. 478. y , 
“m Femi ta tests of several kinds were used extensively in the armed forces 

during World Rae II, seemingly with greater success than in civilian life. Both 

SPurious and le itimate factors account for this. See the searching analysis by 

A. Ellis and H S Conrad “The Validity of Personality Inventories in Military 
ractice,” Psychological Bulletin, Vol. 45, 1948, pp. 385-426. 


500 Personality Inventories 


each trait. (The elements themselves, taken separately and in isola- 
tion, are not, however, the trait.) : , 

In some cases when, consciously or unconsciously, persons mis- 
represent themselves by their answers on an inventory, the instru 
ment may still be clinically valuable, because the fact that they Hane 
misrepresented is significant in understanding their personalities, by 
means of subsequent interviews. 

Psychometric analysis is useful as one of several clinical pro- 
cedures, when its results are considered in conjunction with other 
evidence (e.g., the individual’s history and psychological interview). 

Answers to items of a questionnaire may be employed as the start- 
ing points of subsequent psychological interviews, since 
various questions and responses to various st 
nificant in themselves, or they may reve 
havior, attitudes, and feelings. In such in 
and percentile ranks can be disreg: 


answers to 
atements may be sig- 
al significant patterns of be- 
stances the numerical scores 
arded, Pragmatically, at the present 
time, a useful test of Personality is one whose score or responses to 
individual items assist in identifying areas of actual or potential mal- 
adjustment for purposes of further, more intensive study and subse- 
quent treatment. Conversely, they can help in the identification of 
arcas of wholesome adjustment. At their present stage of develop- 
ment, this is one of the most useful ways in which results of person- 
ality inventories can be employed. 

Personality inventories are useful in the study of group trends; that 
is, in differentiating between groups of adjusted and maladjusted 
rather than between individuals. 

Since personality traits and attitudes m 
be expected that inventories will be less 
retest scores) than, for example, tests of 


ay undergo change, it is to 
reliable (in terms of test- 
general intelligence. 
Concluding Statement. Testing, 
ality, is a justifiable procedure. B 
able still permit too wide 
critical group use; nor are thej 
elaborate statistical processes 
jected. Tests of personalit 
personnel who are versed 
know their limitations, and 
analysis of behavior. 
Much more research is needed in this field. In addition to inventory 
reliability, the variables of Personality tests will have to be of greater 
psychological significance. That is, the traits being measured should 
be more carefully defined, and every item should be clearly and sig- 
nificantly related to the chosen trait. The wording of items should 


y should be employed only by professional 
in the principles of their construction, who 
who are capable of insightful psychological 


Evaluation of Personality Inventories 501 


be clearer so that their meanings, so far as possible, may be uniform 
for all subjects. Improved, more finely graded means of answering 
should be provided in place of the categorical yes, no, ?. The diagnos- 
tic value of each item should be established. For this purpose, statis- 
tical analysis is an essential tool. 

The criteria against which personality tests are to be validated must 
also be made more reliable than they are at present. If, for example. 
clinical diagnoses are used as a criterion, they must be accurate and 
dependable. Too often this is not the case. In some instances “tentative 
diagnoses” instead of final ones have been employed, or, as not in- 


frequently happens, diagnosticians do not agree among themselves 
her illustration—it is unsound to use a blanket classi- 


Again—as anot 
fication such as “delinquency” or “problem behavior” as a criterion, 
because there are a variety of kinds of delinquents and problem be- 
haviors, differently motivated and occurring under varying conditions. 

In this chapter, the reader has become acquainted with the range 
Of traits being tested, the general purposes of personality inventories, 
their similarities and differences, and the types of items being used. 
But in view of inadequacies of available questionnaires and of their 
inherent deficiencies, psychologists have been giving increasing atten- 
tion, in recent years, to the study of personality by means of projective 
Methods, which are presented in the following chapter. 


OF svsmnes >. Ash, “The Reliability of Psychiatric Diagnoses,” Journal 
For example. P. Ash Vol. 44, 1949, pp. 272-276. 


of Abnormal and Social Psychology, 


19. 


mwv 
MAAA AAAA AAAA IANA 


PROJECTIVE METHODS: THE 
RORSCHACH AND THE THEMATIC 
APPERCEPTION TESTS 


DEFINITION AND EXPLANATION 


Psychologically, the term projection means: (1) the un- 
conscious process whereby an individual attributes certain thoughts, 
attitudes, wishes, emotions, or characteristics to objects in his environ- 
ment or to other persons. (2) Projection also takes the form of at- 
tributing one’s own needs to others in his environment, (3) Or it may 
take the form of drawing incorrect inferenc 
Process is not recognize 
that the content of the 


es from an experience. The 
d as being of personal origin, with the result 
Process is experienced as an outer perception 


m an opportunity to impose upon it his 
articular perceptions and interpretations. 
Tojective method (pictures, ink-blots, in- 
Ssociations, one’s own writings and draw- 
ings, and others) are intended to elicit responses that will reveal the 
individual's “personality Structure,” feelings, values, motives, char- 
acteristic modes of adjustment, or “complexes.” He is said to project 
the inner aspects of his Personality through his interpretations an 

creations, thereby involuntarily revealing traits that are below the 


surface and incapable of exposure by means of the questionnaire typ? 
of personality test. 


Definition and Explanation 503 


Personality inventories are standardized questionnaires which ask 
how the subject feels or acts in a variety of representative situations. 
A projective test, by contrast, is unstructured; instructions are general 
and are kept at a minimum to permit variety and flexibility of re- 
sponses; there are no suggested responses; the responses are the 
subjects own spontaneous interpretations or creations. (A partial 
exception to this is the “projective questionnaire,” which is partially 
Structured.) Responses to the projective test involve cognitive factors 
—those that relate to what is there to the senses—and affective 
factors, or feelings about what is there. 

The purpose of the projective test is not told to the subject, a 
frequent practice being to inform him that it is a test of his imagina- 
tion. The most widely used projective methods (e.g., the Rorschach 
inkblots, the Murray pictures, and the like) are tests of perception,’ 
which is an individual process. The less clear-cut the situation, the 
greater will be the individual differences in perceiving it. These tests, 
therefore, provide relatively unrestricted opportunity for the exercise 
and expression of individual differences in perception; for each sub- 


ject sees what he himself is disposed to see and does what he is per- 
through his manner of confront- 


Sonally disposed to do. In so doing, manne 
tuation, the individual indirectly 


ing and responding to the stimulus si 
tells the examiner about himself.” 

By contrast with inventories that attempt to portray the personality 
Segmentally and imply that personality is the sum of the segments, a 
Projective technique attempts to view and understand a personality 
as a whole, and to interpret its components in their interrelationships. 
This viewpoint is generally known as global or organismic theory, 
according to which the whole and its parts are mutually related, the 
Whole being as essential to an understanding of the parts as the parts 
are to an understanding of the whole. According to the organismic 
View, furthermore, the measurement and evaluation merely of com- 


Ponents does violence to the whole organized structure. 


1 Normal perception is defined as awareness of objects, conditions, and rela- 
tionships as unified, articulated mental structures. Perception is also defined as a 
Mental een lex or integration that has sensory experiences as its core. Disturb- 
ances of perception will be shown by lack of integration, distortion, and bizarre- 
Ness, 

2 K i a m- 
“This i ay that each person’s perceptions and responses are co 
Pletely idios ie TO each stimulus situation there are certain responses that 
are quite pee enti ‘obtained from certain groups. But there are also numerous 
Individual Sea and combinations which give each person's total response- 

Pattern a degree of uniqueness. 


5 Projective Methods 
504 


The organismic conception of personality has emerged from clinical 
studies of numerous individuals whose behavior could be understood 
only in terms of the interrelationships and interdependence of traits, 
and from the many experimental investigations of perception and be- 
havior. Projective methods, therefore, are regarded by many psy- 
chologists as the most valuable tests of personality because they are 


concerned with complex mental processes and portray the whole 
personality. 


THE RORSCHACH TEST 


Description and Procedure. This is the well-known and 
widely used inkblot test, named after Hermann Rorschach, a Swiss, 
who began his experimentation with inkblots as a means of stimulating 
and testing imagination. He was not the first investigator to perceive 
the possibilities of inkblots in experimental psychology, although his 
work was the most extensive of any, continuing from 1911 to 1921. 
He is credited with being the first to develop a technique for their use 
in personality diagnosis.* He also changed the emphasis from content 
analysis to determinant analysis, which is explained in the following 
pages. Rorschach developed his test and methods as a practical tool 
to be applied to clinical cases in the study of unconscious factors in 
perception, and to reveal dynamic factors of behavior and personality. 
He proceeded on the principle that every performance of a person is 
an expression of his iotal personality, the more so if the performance 
is concerned with nonconventional stimulus-situations in response tO 
which one cannot wilfully conceal his individuality. In responding to 
inkblots, the subject is generally unaware of what he reveals by the 
reports of what he sees. Yet in telling what he perceives, the individual 
is giving a general portrayal of his personality, 

The Rorschach Test—now used from the nursery-school level 
through all ages of adulthood—consists of ten cards, on each of which 
is one bisymmetrical inkblot. Five are in black and white with dif- 
ferently shaded areas. Two contain black, white, and color in varying 
amounts; three are in various colors (“chromatics” ). 


The cards, each of which is numbered, are presented to the subject 


see J. E. Bell, Projective Teer 
1948, pp. 75-76. The great bulk O 
een published since 1930, more espe- 


For a bibliography of early investigations, 
niques, New York: Longmans, Green, 
research on the inkblot technique has b 
cially since 1935. 


The Rorschach Test 505 


one at a time and in prescribed sequence. The instructions are very 
simple; the subject is asked, according to Rorschach’s own formula: 
“What does it look like? What could this be?” Several clinicians and 
investigators, who have used the test extensively, have somewhat 
modified the original instructions, though not in its essentials. Klopfer 
and Kelley, for example, use this formula: “People see all sorts of 
things in these inkblots; 
now tell me what you see, 
what it might be for you, 
what it makes you think 
of.” * The principal differ- 
ences in the directions of 
the several specialists who 
have evolved their own for- 
mulas is in the amount of 
encouragement and urging 
used to elicit from the sub- rig, 19.1. A Rorschach Inkblot with Lo- 
ject the fullest possible re- cation of Responses. 1. Butterfly, 2. Pair 
sponse to each card. These of legs. 3. Hands pointing. 4. Profile of 


Modifications—in contrast 2 dog. 5. A pointed cap. Inkblot without 
location, from H. Rorschach’s Psychodi- 


with igid directions to j ‘oct 

be a = es agnostic plates. (By permission Grune 

: owed in administer and Stratton.) 

Ing tests of intelligence, Spe- 

cific aptitudes, ete—might 

Sistencies found in the numer 
Rorschach did not impose ti 


account in part for some of the incon- 
ous published reports on the Rorschach. 
me limits; nor do most of the present 
Users, Nor is there any fixed number of responses for each card. The 
examiner makes note of various aspects of the subject’s behavior: 
Namely, a verbatim record (so far as possible) of the responses; time 
elapsed between presentation of each card and the first response to 
it (reaction time); length of time in long pauses between responses; 

time); position in which 


total time required for each card (response 
(indicating extent of the subject’s ex- 


card is held for each response tent 
Ploration of the stimulus-situation ) ; the subject’s extraneous: move- 
Ments and other behavior of significance. The three recordings of 
time are useful in determining emotional blocking or resistance to 
E m eA . bee 


World Book Co., 1942. See also. Sy 
rocesses, New York: Grune and Stratton, 1944. 


506 Projective Methods 


what the individual might be perceiving in a particular inkblot. 
Directly after all ten cards have been presented for responses, 
second phase, the inquiry, follows. There are two main purposes oO 
the inquiry. The first purpose is to learn just what parts or aspects of 
the blot were perceived and responded to; that is, what features of 
the stimulus-situation initiated and sustained the spontaneous asso 
ciation process: response to wholes, parts, small details, location, 
color, shading, apparent movement—all of which are essential items 
of information for scoring purposes. Second, in some instances the 
inquiry gives the subject an opportunity to add to or clarify his 
original responses; but if this is done, it must be completely spon- 
taneous on the part of the subject and without any suggestion what- 
ever from the examiner. The only questions that should be asked are 
those needed to clarify the scoring and to include in the score all 
the significant aspects of the responses. Too many or too leading 
questions by the examiner may elicit from the subject responses which 
do not represent his own perceptions but spring rather from sug- 
gestions more or less implicit in the questions asked. The only specific 
questions asked are for the purpose of determining the exact area of 
location to which particular responses refer, this being necessary for 
Scoring and analysis under the several categories of the test. 


Scoring. The scoring, following Rorschach for the most part, is 
based upon four major categories. 


Location. The first is the location, or area, which has been per- 


ceived as the basis of each response. This may be the entire inkblot, 
a large portion, a small portion, a minute detail, or part of the white 
background. The area may be well defined, or mere 
blurred. Location of responses is the basis of obt 
wholes (called W’s), large usual details (called D’s) 
details (called d's), unusual detail (called Dd) 
(called S), which are parts of each person’s 


ly vague and 
aining scores for 
, and small usual 
, and the white spaces 
pattern of response tO 


The Rorschach Test 507 


the entire test. There are also other symbols used to designate other 
aspects of location; but these five are the major categories. 

The locations of responses and the subject's ability to delineate 
them are regarded as indicative of his perceptual organizing processes, 
of his ability to analyze and articulate the parts, and of his associations 
as his perceptions shift within each blot. Analysis of responses in 


respect to location is said to reveal extent of the subject’s perceptual 
ed in terms of agreement with 


Organization or disorganization measur 
yze the whole and synthesize 


norms of perception and ability to anal 
the parts. 


Determinants, The second category includes the determinants, or 


characteristics, of the inkblot as perceived by the subject. The de- 
terminants are those aspects or qualities of the blot that have pro- 
duced the responses to it. These may be form, shading, color, per- 
spective, or motion—Or combinations of these. Forms may be per- 
ceived with ordinary accuracy (F); or they may be unusual and clear 
percepts (F+); or poor percepts (F—). Generally, evaluation of 
form is a matter of the examiner's judgment, although some in- 
Vestigators have provided normative descriptions and numerical 


Scores, 
nd interpretation of shading noted by 


The frequency, intensity, @ 
: 4 manner in which the subject responds 


the subject are recorded. The 
d to be relevant to the manner 


to the shadi f the blots is sai 
A ing (K) 0 4 | 
in which he meets and satisfies his own affectional needs: whether 


by conscious denial of affectional need, or by a repressive mechanism, 
or by insensitivity and undeveloped affectional relationships with 


other persons. n : 
With regard to color (C), the examiner records the particular 
Colors reported and the manner in which the subject combines color 
With form. there being three categories: responses to pure color with- 
E i rded as C); responses to a combination 


Out form being involved (reco 
of form and m Tof in which form is dominant (FC); and responses 


to a combination of color and form in which color is dominant (CF). 
A score for movement is given by most examiners when the subject 


idson, “Form Level Rating: A Preliminary 

a Bd Tevel of Thinking as Expressed in 

Research Exchange, Vol. 8, 1944, pp- 164- 

7 f.; B. Klopfer et al., Developments in 
World Book Co., 1954, Chap. 8. 


Pr "See B. Klopfer a 

Oposal for Appraising 
Orschach Records,” Rorschach 
7. Also, Rapaport, op- Cif PP: 


the Rorschach Technique, NeW York: 


508 Projective Methods 


perceives something going on in the blot, whereas Rorschach a 
restricted the movement score (M) to Tesponses which ae 
empathy; that is, a true experiencing of or identification with 
movement reported (obviously an extremely difficult phenomenon 
for the examiner to discern). At present a common practice is to 
reserve the symbol M for human movement, to designate animal 
movement as FM and inanimate movement as m. ; 

The subject’s mention or use of perspective or depth (FK) is also 
noted and scored. Parts of the inkblot are perceived as having per- 
spective and being three-dimensional. FK, “in reasonable numbers, 
is said to be related to good adjustment, through attempts to handle 
affectional anxieties by introspective efforts. 

The foregoing are the principal determinants as developed by 
Rorschach, and as further developed and modified by Klopfer and 
his collaborators. Somewhat different sets of determinants have been 
developed from Rorschach’s original ones by several other psycholo- 
gists, notably S. J. Beck. Although there are some differences in 
symbols used and in details of response classification, the conceptual 


Similarities between the several Systems are much greater than the 
dissimilarities. One of Beck’s categories, in particular, should be 
mentioned in connection w 


ith response determinants; that is what he 
calls organization (designated by Z), This determinant, an extension 
of the concept of the whole (W) 


create) “new and meaningful rel 
not usually so organize: 
of intelligence, A rel 
be useful in detecti > anxiety, or unresponsiveness in 
ntelligence. ; 
egory is content. Here the subject's 

al more common groups such 
as plants, animals, humans, landscapes, man-made objects, anatomy: 
sex, and others. Content items are not merely classified into specified 
groups; they are used by the examiner as a source of ascertaining the 
subjects personal meanings, attitudes, interests, and even “com- 
plexes.” Some examiners have interpreted content items, also, n 
having psychiatric or psychoanalytic meanings. For example, in some 
contexts the response “eyes looking at me”—in some of the cards—!§ 


a Se 


* Beck, op. cit. 


2 


The Rorschach Test Sag 


Sven the obvious interpretation of “paranoid reaction.” “Puppets” 
or “marionettes” perceived in a card are interpreted at times to sug- 
gest schizoid tendency, as a feeling of being influenced and directed 
by hostile persons. j 

Originals and Populars. The fourth scoring category is originality 
—also known as “popularity-originality.” This has to do with the 
rating of a response as one that is commonly given (popular) or as 
One that is uncommon (original). Investigators and interpreters of 
are not entirely agreed as to which responses 
and which original, although there are, of 
about which there is no doubt. However, if 


are to achieve a satisfactory and essential 
this 


Rorschach test responses 
shall be scored popular 
Course, many responses 


Users of the Rorschach test 
gard to the significance of their results, 


degree of uniformity in reg 
ality will have to be resolved statistically, 


Problem of popularity-origin 
in a manner similar to that used in other types of tests. Some norma- 


tive studies have been made in this area.* 

responses according to the foregoing 
one of the modifications and elabora- 
itself. The major purpose of the test 
tment and to learn whether he 
s—in short, to get insights into 
tained ordinarily by direct 


Interpretation. Scoring of the 
Categories, or according to any 
tions thereof,’ is not an end in 
'S to assess the subject's general adjus' 
'S experiencing psychological difficultie 
his personality, such as could not be o 


questioning, 

Although considerable experience under supervision is necessary 
to learn the techniques of administering and scoring the Rorschach, 
Much more experience and expertness are required for the interpre- 
tation of scores as organized meaningful whole. With the Rorschach, 
ng system and skill in (perhaps the art of) 
seid ntial. The particular items in 
n themselves most important; 


the two aspects—a scori 
Psychological interpretation—are esse 
the responses of the subject are not i 
the inferences drawn from them are. 


ir d Eni i . 
For ex m . Hertz, Frequency Tables to be Used in Scoring 
Beye M Ink-Blot Test. Department of Psychology, Western 
serve University. 1946: w. N. Thetford et al., “Developmental Aspects of 
rsonality Scales in Normal Children.” Journal of Projective Techniques, 
Bee 15, 1951 58-78; R. Carlson, “A Normative Study of Rorschach 
Responses ok Bibhe-Year-Old Children,” ibid., Vol. 16. 1952, pp. 56-65. 
P, ” For exam E R. L. Munroe, “prediction of the Adjustment and Academic 
crformance iR College Students by a Modification of the Rorschach Method, 


Appl; 
Pplied Psychology Monographs, No. 1945. 


510 Projective Methods 


A simple example will show how a response is scored for the sev- 
eral categories. The subject states that the inkblot on the first card 
looks like a bat. This response is interpreted as showing: that the 
whole blot (W) was perceived (location, or area); that the percep- 
tion was based upon form, and that the blot is sufficiently similar 
to a bat in shape so that the response is acceptable (F); that the 
response is classified under “animal” (content, A); and that the 
response is commonly given, or popular (P). Hence, using symbols, 
this response would be recorded as W, F, A, P. 

Interrelationships. Having scored each response, tabulated the 
results, and drawn a psychogram, the next and very significant step 
is to analyze the relationships existing between the frequencies in 
several categories, since the frequencies in isolation are not all-impor- 
tant. For example, the following are among the relationships investi- 
gated: the proportion of responses in each of the scoring categories, 
this being of great importance; the ratio of wholes to larger details 
and to minute details; extent to which form is used with other de- 
terminants, such as color and shading, form being dominant; extent 
to which color is used with other determinants, color being domi- 
nant; relationship between color and movement responses; frequency 
with which color is named apart from other determinants; the ratio 
of human-movement responses to animal-movement responses; ratio 
of pure movement responses apart from other determinants to move- 
ment plus other elements, especially form; percent of original re- 
sponses. These and other comparisons and interrelationships show 


the organization and patterned characteristics of the individual’s per- 
sonality, as measured by the Ror: 


indicative of strong uncontrolled 


a ‘ elatively serious; in another 
record if there are balancing factors they may be indicative of satis- 
factory adjustment, 


great detail, a few illustrations wi 
see how responses are interpret 
it should be stated again th 

Location. The locatioy 
intellectual aspects of pers 


ed. At the risk of tedious repetitio" 
at no single sign is the basis of a diagnosi“: 
? responses are used mainly in evaluating 
Onality: the manner of approach to a pet 


The Rorschach Test paz 


ceptual problem; the preferred mode of apperception. A predomi- 
nance or large percentage of whole responses (W)—that is, emphasis 
upon the whole rather than the particular—is characteristic of persons 
of higher levels of capacity for intellectual organization and abstrac- 
tion. Not only the number of wholes is important here; the originality 
and appropriateness of the responses must be considered; for simple, 
popular wholes indicate superficiality and commonplace thinking. 

Predominance of common responses of detail (D) is regarded as 
evidence of concrete, unoriginal, practical mental processes. Re- 
sponses of unusual detail (Dd) indicate perception of the unusual, 
associated at times with precise and critical mentalities. But carried to 


an extreme, rare detail responses may indicate an obsessive preoccu- 


Pation with the trivial, often accompanying states of anxiety. Some 
ings in location responses; 


Clinicians have claimed more subtle meani 
as, for example, that a marked concern with edge details signifies 


the presence of an escape mechanism. 
__ Rigidity of approach to problems is said to be indicated by a sub- 
ject who uses the same procedures with all cards in making his reports, 
beginning with wholes, then proceeding to larger details, minute de- 
tails, and white spaces. The reverse of this order is found less fre- 
quently, Rigidity is distinguished from orderliness of approach, the 
latter being indicated when a uniform procedure Is followed in most 
location responses, but when, also, variations are used on occasion. 
Psychotic and some emotionally unstable persons are confused and 


Chaotic in their approach, no order or plan being observable. 
orm perception is clear and accurate (F+ 


Form. If a person’s f ee 
Or F), he is said to have firm control over his intellectual processes: 
and behavior. By contrast, the schizophrenic, whose behavior is dis- 
Organized and whose perceptions are distorted, often reports inappro- 

). A high percentage of 


Priate and bizarre forms for the blots = ! ; 
orm scores independent of other determinants is regarded as evidence 


Of restricted emotional and social adjustment (repression or suppres- 
sion), Form scores combined with other determinants indicate the 


degr A A 
€e of intellectual activity- , , 
Color, The subject's responses to color in the blots provide the 
Most direct evidence of the subject's impulsive life and emotional re- 
ationships to his environment, the combination of color and form and 


Extreme accent on color alone being regarded as the most significant 
Signs, Color usually arouses some degree of feeling or emotion in the 


Projective Methods 
512 


respondent. The subject may entirely suppress his a ie 
spontaneously respond solely to the color without associating i for 
a form or object; he may combine color and form, one or the cone 
being dominant. The degree to which color is the sole or ponani 
determinant in a subjects responses is viewed as indicative of his 
emotional intensity. The extent and manner in which he combines 
color with form is evidence of the degree of control over emotional 
impulse. Predominance of form-color responses (FC), perceived as 
meaningful wholes, imply optimal emotional control, accompanied 
by a capacity for social adaptability. Color-form responses (CF), in 
which form is secondary, when dominant suggest a somewhat im- 
pulsive, egocentric personality. A large Proportion of pure color re- 
sponses (C) is indicative of emotional impulsiveness, 

An additional kind of color response, “color shock,” has been ob- 
served. This happens when the respondent is thrown off balance by 
the presentation of a colored card, especially when it follows a black- 
and-white blot. Shock is shown by a delay in re 
mations, by peculiar responses; or by inability to respond to color. 
Color shock is believed to imply anxiety neurosis; the person's ability 
to respond is seriously impaired by loss of emotional equilibrium due 
to affect produced by color, Color shock has not only been found 
frequently with neurotic persons; it seems to be typical of persons 
more seriously disturbed mentally. 

The concept of “color shock” and actual appearance of the phe- 
nomenon, however, have been questioned and subjected to experi- 
mental study. Most of the experimental data do not support the orig- 
inal concept.” But critics of these statistical studies maintain that the 
concept of “color shock” and interpretation of color responses have 
been handled in a mechanical manner, without due regard to the in- 
dividualized way in which color is handled in the record. They hold 
that how the subject responds to color, if at all, is more significant 
than statistical counts, in judging the impact of color stimuli upon 
emotional responsiveness. These critics of the usual statistical analyses 
point out that the following aspects of response must be evaluated in 
clinical interpretations of color responses: color selection, color shy- 


action time, by excla- 


See, for example, J. A. Perlman, “Color and the Validity of the Rorschach 
8-9-10 Percent,” Journal of Consultin 


; 8 Psychology, Vol. 15, 1951, pp. 122-126; 
B. T. Meyer, “An Investigation of Color 


Shock in the Rorschach Test 
Journal of Clinical Psychology, Vol. 7, 1951, pp. 367-370. 


Lhe Rorschach Test 513 


ness, color denial, color avoidance, and color disregard with objective 
disturbance." z 

The disagreement noted regarding this area of Rorschach interpre- 
tation exemplifies, in general, a principal basis of difficulty encoun- 
tered in efforts to objectify interpretations of responses and to validate 
the instrument as a whole: namely, the conflict between those who 
would break down the responses into elements for purposes of statis- 
tical analysis, and those who maintain that by so doing the significance 
of each element is destroyed, since each must be viewed in its relation- 
ships to the whole. This latter group maintain, also, that available 
statistical methods are inadequate for handling this problem of pri- 


mary importance. 

Shading. Responses that use sł 
Manner in which the subject mee 
Sponses are interpreted as being related to 
tudes, and feelings of inadequacy. 

The movement score (especially human movement), 
idence of the richness of one’s associa- 
tive life: the higher the score, the richer the associations and the richer 
the imaginative life. A large number of reports of human movement, 
Combined with responses to color but not outweighed by them, indi- 
eness and giftedness. A high frequency of human 
Movement with little or no color response is said to be characteristic 
Of persons having a rich inner life but little affective response to the 
Outside world. Such persons were named introversive by Rorschach. 

accompanied by few re- 


N the other hand, much color response, l 
Ports of human movement, indicates the extraversive personality. 


Beck, one of the productive students of the Rorschach, reports 
that the significance of movement responses differs with various per- 
Sonality organizations. 2 In emotionally stable adults and in some 
Deurotics, ie is as stated above. In cases of schizophrenia, movement 
Yesponse is indicative of a highly subjective and personal experience. 
Th adults having adjustment problems but without psychosis, it rep- 
Tesents fantasy living; in the manic, it indicates egocentric wish ful- 
Iment, 

Klopfer, who has publist 


hading are said to be related to the 
ts his affectional needs. These re- 
anxiety, depressed atti- 


Movement. 
according to Rorschach, is ev 


Cies criss 4 
ates superior creativ! 


hed extensively on the Rorschach, makes 


i 
S i . 338 ff. 
12 g °, Klopfer et al., op. bl PP Rorschach Finding,” Journal of 


- J. Beck, “Psychological Processes in j 
Abnormal and Social Psychology: ‘ol. 31, 1937, pp- 482-488. 


514 Projective Methods 
5 


a distinction between reports of human movement and those of animal 
movement." In respect to the former, he agrees essentially with Ror- 
schach’s interpretation. But a large proportion of the latter, according 
to Klopfer, indicates that the person is functioning on a “level of in- 
stinctive prompting” rather than at the level of creative activity. 
Hertz and Kennedy have added still another interpretation to move- 
ment responses.’ They report that a high movement score combined 
with satisfactory form, originality, detail, and organization indicates 
superior intellectual ability. 
Content. The number, Proportions, and kinds of things repre- 
sented in the responses have been variously interpreted as having 
psychoanalytic significance (fantasy life, symbolic meaning); as show- 
ing the amount of stereotyped thinking in the subject; as a sign of 
maturity or immaturity; as revealing feelings of inferiority; as reflect- 
ing the subject’s interests, obsessions, and compulsions. Much of this 
content interpretation is tentative and needs experimental confirma- 
tion. 
Originals and Populars. The percentages and number of original 
and popular responses, as might be expected, are taken as evidence of 
the subject’s level of intelligence, though the quality of the originals 


must be evaluated; for these might be no more than bizarre items of 
distortions of perception. 


I nterrelationships, 


s Insecurity and anxiety: 
“types,” etc." 


"4B, Klopfer and D. M. Kelley, 


14M. R. Hertz and S. Kennedy, “The M Factor in Estimating Intelligence 
Rorschach Research Exchange, Vol. 4, 1940, pp. 105-106. ; 


iS Some of the recent experimental work on movement has been concerned 
with its relation to the Perceptual process. See R. Arnheim, “Perceptual a” 
Aesthetic Aspects of the Movement Response,” Journal of Personality, Vol. 1? 
1951, pp. 265-281; G. S. Klein and H J. Schlesinger, “Perceptual Attitude? 
toward Instability: Prediction of Apparent Movement Experiences fro 
Rorschach Responses,” ibid., Pp. 289-302, 

16 For several of these, see Bell, op. cit., Chapter 16. 


op. cit, 


The Rorschach Test 515 
eee os determining the nature of a given personality and that 
RES on = of the clinician is to interrelate the indicators in such 
eae r @ yield a meaningful whole. It is obvious, also, that scoring 
f ta es among the experts are as yet not uniform, being in a state 

opment. It should be apparent, too, that there is a high de- 


ay of subjectivity in both the scoring and interpretation of re- 
Ponses—and necessarily so in dealing with an unstructured test. 


Number of Responses 
(To be filled in by Examiner) 


rerentiated Shading) 
+ 
Fe e 


Texture and Achro- 
matic Color 


From the Individual Record 
orld Book 


cr | ü 


Bright Color 


~Piftusion—Vista 


FIG. 19,2, Rorschach response profile. ( u 
H. H. Davidson, New York: W 


Blank, by B. Klopfer and idsi 
Co. By permission.) 
he use of such instruments and 


an be achieved only after care- 


siy means, of course, that skill in t 
T attainment of maximum validity E 
Y supervised practice and experience. 
igure 19,2 is a profile form devised by Bruno Klopfer and Helen 
- Davidson,” As reproduced here, it shows the number of responses 
of an actual case in each category, to illustrate the manner in which 
results are portrayed. Table 50, also by Klopfer and Davidson, is 
Shown so that the student may see more clearly the extent to which 
the various scores are interrelated and comprehend more clearly what 
S Meant by interpretation of Rorschach responses as a whole. The 
Information in this figure and the table provide the starting materials 
°F interpretation. 
Personality “Structure.” The Rorschach test is a “multidimensional 
~Sttument” that is intended to yield information regarding the “struc 


ys... — 
Published by World Book Co. (By permission.) 


516 Projective Methods 


ture” of the subject’s personality. Three major “dimensions” are ap- 
praised: namely, conscious intellectual activity, externalized emotions, 
and internalized emotions. Each of these is “measured” in terms of 
the several scoring categories already explained. By “structure” is 
meant the manner in which the various personality aspects or traits 
are interrelated so as to produce each of the three major “dimensions” 


RELATIONSHIPS AMONG FACTORS 


Total Responses (R) = Estimate of Intellectual Level 


Total Time (T) = Intellectual Capacity Intellectual Kgicieney 
. Very Superior ws. Very Superior 
Average time per response (0) - aap at g 
Average reaction time for Cards I, IV, V, VI, VII = ji ua ue 
Average reaction time for Cards II, III, VII, IX, X = +++ Dull Normal +-+- Dull Normal 
Total F _ Fr +++ -Feebleminded, Feebleminded 
R 7 Note that this estimate is based mainly on the following: 
j number and quality of W 
FK+FiFe g umber aad quality of M 
R level of form necurncy 
number and quality of O 
AEM. Ag Variety of content” 
R succension 
Number of P = Manner of Approach 
Number of O = WD o a ga) ba 4 s¢_#) 
m 
(H + A): (d + Ad) = n Enter tse location percentages in the spaces abore, Compare thest PY 
nS OA) tentages with the norms shaven in the boz bel, by placing a cheek ME 
jun Cu FC? Opposite the appropriate range of percentages, 
and 
M:mmC= ; W p g aiT i 
(FM + m): (Fe + e + C) = : <10% ((W))| < 30% ((D)) 
; : 10-20 (W) | 0-15 (D) 
No. of respos K. 
O. of responses to Card VIIL IX.X q y oan Sloe bas 
WM= pas aor 
~ 45-60 W i$ 
Succession : = 
na o ao y 
Rigid Orderly Loose Confused = > 
(Place a check mark at the appropriate point on the scale aboe) 


[4] 


TABLE 50. Outline for analyzing Rorschach responses." 


which, themselves, are 
whole personality. 


riences, What are his characteristic pe" 
viors that result from the organization o 
in regard to “structure,” some of the ques 
pattern of responses are intended to answ@? 
ality original or stereotyped? Are his abiliti® 
? Is he less adaptable or more adaptable t° 
r the “outward” life stronger? Or, more SP% 


‘From the Individual Reco 
New York: World Book Co. ( 


are: Is the subject’s ment 
creative or reproductive: 
reality? Is the “inner” o 


rd Blank, by B, Klopfer and H. H. Davidso™ 
By Permission. ) 


The Rorschach Test 517 


cifice indivi 
ifically, how and to what degree does the individual control his emo- 


tions and feelings? Is he prone to anxiety? 
Taking one of the three major “dimensions”—intellectual activ- 


ity—as an example, it would be appraised by the scores on the follow- 
aspects of the responses. Each aspect is accompanied by the fac- 
Ors regarded as significant elements in interpretation of mental level. 


Perception of form: quality and percentage of total responses; 
clearness vs. vagueness; accuracy VS. inaccuracy (F, F+, or F-). 

Perception of wholes: number and percentage of total responses, 
ability to integrate; ability in abstract and theoretical activity. 

Perception of major and minor details: number and percentage of 
total responses; concrete or practical intelligence. 

Organization: ability to perceive and create new wholes: creative 
ability. ° 

Original and popular responses: qu 
Vs. commonplace, stereotyped thinking. 

Animal content: “sterility” or immaturity of thought processes 
(an excess of animal responses) vs. rich and insightful thought 
Processes (perception of forms of a variety of categories). 

Productivity: total number of responses: intellectual energy. 

Sequence: rigid, or orderly and flexible, or loose, or confused ap- 
Proach to a problem, indicating level of intellectual control. 

Movement (mainly human movement): inventive and creative 


abilities yy, stereotyped thinking. 


alitative richness of response 


Reliability, As with all other types of psychological tests, reliability 
need to be demonstrated if this instru- 


and validity of the Rorschach 

Ment is to attain maximum usefulness. It is undoubtedly true that 
Widely experienced clinicians using this test are able to draw some 
remarkably acute inferences from Rorschach protocols; but this is a 
highly subjective procedure. Ideally, it is desirable to demonstrate 
Sbjectively the Rorschach’s reliability and validity; and considerable 
Tesearch effort—both clinical and experimental—is being devoted to 


that end. 
There are, however, several problems and principles that must be 
€pt in mind in evaluating reliability and validity studies on the Ror- 
Schach, First, interpretation of responses is “global”; that is, the sig- 
nificance of responses is determined by the total pattern, whereas 
analysis into parts and part-scores for the purpose of statistical treat- 
Ment does violence to the principle underlying the test and to the 
Meanings of the responses. As yet there are no statistical methods for 


518 Projective Methods 


dealing with integrated, “global,” patterns of responses. The several 
methods thus far suggested are quite inadequate for handling this 
complex problem. Second, rating or scoring of responses introduces 
a greater subjective element (than in the case of intelligence tests) 
because there are no right or wrong answers, the variety of responses 
is considerable, and normative data are meagre. As criteria of validity, 
differential clinical diagnoses are not sufficiently uniform or reliable, 
and descriptions of behavior are non-quantitative and subjective to 
an appreciable degree. 

The foregoing points emphasize the unusual difficulties encountered 
in attempts to validate and objectify projective techniques in the 
same manner as other types of instruments have been. Although the 
problems of validity and reliability have not been solved, much work 
has been done. We shall describe briefly the principal methods em- 
ployed, together with the present status of the findings. 

Studies devised to show degree of Rorschach reliability may be 
classified under five types, as shown below. Each approach will be 
briefly explained and evaluated. 


Parallel series of cards 

Split-half correlations 

Test-retest correlations 

Matching interpretations of Rorschach records 

Attempts on the part of the subject to falsify responses and mis- 
represent themselves 


MERES 


Parallel Series. The use of two parallel series of cards is intended 
to be analogous to the use of two equivalent forms of a test of intelli- 
gence or of specific aptitude. In this country a set of parallel cards 
has been devised by Harrower and Steiner.’ Data in the manual show 
that this newer set of cards elicit much the same types of responses 
as do the Rorschach cards, and in similar proportions. Unfortunately, 
however, no systematic statistical analyses of scores are presented in 
the manual; nor has there been sufficient evidence from other sources 
to warrant any but the most general conclusions regarding the degree 
of equivalence of the two sets of cards. The original Rorschach and 
the Harrower-Steiner have yielded sufficiently similar results for the 


1” M. R. Harrower and M. E. Steiner, Psychodiagnostic Inkblots, New York: 
Greene und Stratton, 1945. 


The Rorschach Test 519 


Same subjects as to warrant the use of the latter as a supplemental 
Instrument, 

It should be pointed out that devising two equivalent or comparable 
ttl nt is a much more difficult task than devising two sets 
: alent questions or problems; for unless two inkblots are iden- 
tical, their stimulus values will be different in some degree. For this 
reason, the newer sets of cards are appropriately called “parallel” 
rather than equivalent. The more precise evaluation of the parallel 
forms, and reliability studies based on them, must await detailed 
quantitative data and intercomparisons. 

„Split-half, When the split-half method is used to estimate relia- 
bility, responses to odd-numbered cards are correlated with those 
given to even-numbered cards. That is, the score in each of the cate- 
80ries obtained on odd-numbered cards is correlated with the corre- 
Sponding score on the even-numbered (e.g., number of W's, F's, D's, 
ete.) This method has yielded reliability coefficients ranging from 
low to high: from approximately .60 to .90. 

For several reasons, the split-half method has been virtually aban- 
doned in Rorschach studies. First, there is the usual argument that 
the significance of a Rorschach protocol is to be found in the whale, 


integrated set of responses: therefore, working with isolated variables 
'S invalid, This argument does not have much merit, since isolation 

a variable for statistical analysis is one matter, while isolation for 
another. The second and more pertinent point 
iriations in the stimulus value of each of the 
are not similar in functions or behaviors 


d should not be employed. A third 


inter de s 
p erPretation is quite 
S thi 
t that there are marked va 
en cy : i 
t cards: hence, since they 
© be elici ho 
elicited, the odd-even meth | ian 
Point—one that applies to all quantitative analyses of the Rorschach 
~IS that scoring of responses introduces a strong subjective a 
at s g s a i 
ue to lack of normative data and to the exceedingly large numbe! 
nd vari 7 ses possible 
ariety of responses POSssID!ls- — his 
“est-retest ie test-retest method also has yielded coefficien 


ability and Meaning of 


a cil 
‘ord, “The Reli 


ci p. Guilf 


met 3 j : on 3 d x Social 

Bitene eet a che rest,” Journal of Abnormal and So 

Py Sbnistypus Scores in the Rorschach i p, onda, The E Hale 
e cull nology. Vol. 13. 1933, pp- 89-118, 


biot lon. Vol. 31, 1936, P . 

2715 est.” British Journal o: Ye Sf the Rorschach Inkblot Test, 

~295. a “The Reliability O -M R. Hertz, 

Tournai Tp R, Heriz nan Vol 18: 1934 pp: LT M. P. pera 
Applied Psvcno en” Psychological Bulletin, Vol. 39, Ža 


orse 2 
S29 şonäch: Twenty Years After, 


520 Projective Methods 


from low to high, the maximum being about .90. The magnitude of 
the coefficient has varied roughly with the length of interval between 


tests.” Swift, for example, retested preschool children at several differ- 
ent intervals, with the following results: * 


After 14 days, coefficients varied from 59 to .83 for the several 
scoring categories. 


After a 10-month interval, coefficients v 


aried from .18 to .53 for 
the several scoring categories. 


The findings for the 14-day interval are 
evidence of reliability (internal consistency 
cially with very young children who 
whose lives an interval of ten month 
nificant period. 

Another reliability study, usin 
yielded quite favorable results.” 


much more significant as 
) of a projective test, espe- 
are developing rapidly and m 
s is a relatively large and sig- 


8 twenty chronic schizophrenics, 
* Reliability coefficients were: 


Location categories: .50 (D) to .95 (W) 


Determinants: -16 (FC) to .96 (k) 
Content: -36 (Ab) to .94 (Hd) 
Relative scores: —-17 (8-9-10%) to .g4 (W%, d%, 


F+%) 
On the whole, with a few exce 
moderate to quite high. Low cor 
were due to infrequencies of responses (low N) and narrow ranges 
of N within the categories concerned, The data Suggest, also, that, 
with one exception, there were no significant differences between 
means. An additional reliability test was appli 


i ed: test and retest tab- 
ulation sheets were matched by two experienced Rorschach exam- 


ptions, the coefficients ranged from 
relation coefficients, the authors state, 


21 Cf. Z. Piotrowski, “The Reliability of Rorschach’s is 5,” r 
of Abnormal and Social Psychology, Vol. 32, 1937 nebnistypus,” Junia 
Brosin and E. O. Fromm, “Some Principles of Gestalt Psychology in the 
Rorschach Experiment,” Rorschach Research Exchange, Vol. 6, 1942 pp- 
1-15; A. I. Rabin and M. H. Sanderson, “An Experimental Inquiry into Some 
Rorschach Procedures,” Journal of Clinical Psychology, Vol. 3, 1947 pp- 
216-225; I. A. Fosberg, “An Experimental Stu Reliability of the 


idy of the 
Rorschach Psychodiagnostic Technique,” Rorschach Research Exchange, Vol. 
5, 1941, pp. 72-84. 

* J. W. Swift, “Reliability of Rorschach Scoring Categories with Preschool 
Children,” Child Development, Vol. 15, 1944, pp. 207-216. 


*8 J. D. Holzberg and M. Wexler, “Predictability of Schizophrenic Per- 
formance on the Rorschach Test,” Journal of Consulting Psychology, Vol. 14, 
1950, pp. 395-399. 


The Rorschach Test 521 


iners. They were correct in 85 percent of their selections, this being 
Significant at the one-percent level. 

The test-retest method should be used with caution in connection 
with projective methods, for these reasons: There may be some carry- 
Over resulting from mere recall of previous responses. Since the scor- 
ing of determinants is dependent, in part, upon the subjects’ intro- 
spection during the inquiry, another possibility of inconsistency is 
introduced in retesting. Is the subject always able to state exactly 
which determinants (elements) were primary or involved in his per- 
ceptions? If the interval between examinations is a significant one, 
some changes in behavior may have taken place due to maturation, 
therapy, or environmental factors. 

It appears that Rorschach results are more stable for some classes 
of persons and for some age groups than for others. The question to 
be asked, therefore, is one of specific reliabilities: reliable for whom 
and under which conditions? Partial answers have been given to this 
question; but much more research remains to be done. 

Matching. As a method of estimating Rorschach reliability, match- 
ing is regarded by many experts as most satisfactory, since the whole 
Rorschach picture is thereby kept intact. This is a reliability tech- 
nique in the sense that it attempts to answer the question: “Do the 
responses have the same meanings for different experts?” Internal con- 
Sistency of responses, then, is estimated in terms of consistency of 
Meaning conveyed by the responses; that is reliability of scoring and 
of interpretations. Krugman reports, for example, that three judges 
agreed perfectly in matching twenty Rorschach records with interpre- 
tations that had been made by others. Also, in this same study it was 
found that when two independent interpretations of the twenty re- 
sponse records were made, the two agreed essentially in respect to the 
Significance of about ninety percent of the response data and showed 
Partial agreement on ten percent. In another instance when twenty- 
five records were matched with interpretations, using six judges, the 
average coefficient of contingency was .87.*' 


A N Krugman, “A Clinical Validation of the Rorschach with Problem 
Children,” Rorschach Research Exchange, Vol. 6, 1942, pp. 61-70. Actual 
Percentages were: essential agreement, 89.6%; partial agreement, 10%; dis- 
agreement, 0.4%, The coefficient of contingency is an index of correlation 
which is used when both variables are classified in two or more categories 
rather than being given quantitatively in the form of a continuous or a discrete 
variable. See also J. O. Palmer, “A Dual Approach to Rorschach Validation: 
A Methodological Study,” Psychological Monographs, No. 325, 1951. ‘ 


522 


Attempts to Mislead. The remaining approach—attempts to falsify 
Tesponses—appears to be neither an appropriate nor a reasonable 
method of estimating Rorschach reliability. Some Psychologists have 
called this method “testing the limits of reliability.” The results on al- 
most any type of testing device can be more or less distorted by 
testees, depending upon their intelligence and degree of psychological 
sophistication. When this method is used, the experimenter is not 
actually trying to add evidence to the Problem of reliability, He 1s 
seeking an answer to the question, “Is the Rorschach test resistant t0 
a person’s attempts to falsify responses and to misrepresent his per- 
sonality?” The answer is that some changes in the patterns of re- 
sponse can be effected through deliberate effort; but the amounts and 
directions of change de al factors: namely, relative 
personality maturity and versatility to begin with, normality versus 


maladjustment, intellectual level, amount of knowledge about and 
definiteness of “set” 


Projective Methods 


at experimental 
a good or poor im- 
Ses, in altering their 
S, using the several 
© an appreciable 
nly two other subjects to 


| nal consistency and retest after a relatively 
brief interval. When Scoring categories 


: and elements within them are 
isolated, some functions are found to b 


‘uncti © more stable than others, The 
instrument’s reliability appears to be greater when the “global” or pat- 


terned character of response records are studied than when scoring 
categories are isolated, In evaluating the findings of reliability studies 
on the Rorschach, it is essential to bear in mind the complexity of the 


* For example, M. L. Hutt et al., 
on Rorschach Test Performance,” 
1950, pp. 181-186; R. G. Gibby, “ 
under Conditions of Experimentally 
ibid.. Vol. 15, 1951, pp. 3-26. 

28 A. L. Carp and A. R. Shavzin, “The Susce 
Rorschach Psychodiagnostic Technique,” 
Vol. 14, 1950, pp. 230-233. 


rschach Variables 
ellectual Variables,” 


ptibility to Falsific 


ation of the 
* Journal of Consulting 


Psychology, 


The Rorschach Test 523 


instrument, the subtleties of interpretation, and the influence of ex- 
traneous factors (e.g., “set,” motivation, verbal handicaps, previous 
experience, attitude toward examiner) upon perception. An indi- 
Vidual’s responses in a test situation are not the product of only his 
“constant” personality traits reacting to constant test materials. 

From the data at present available, we must conclude that re- 
liability of the Rorschach test has not been unequivocally demon- 
strated when the usual methods and standards are used, but that the 
evidence on the whole is more favorable than unfavorable. It may be, 
however, that the usual conception and criteria of reliability (such 
as applied to tests of intelligence and specific aptitudes) are not and 
should not be applicable to an instrument like the Rorschach, which 
is unstructured and thus permits variations in responses, which is not 
readily quantified, and which has been employed largely to test and 
describe personalities who are maladjusted or in a fluid state. Perhaps 
the more significant result is the very high percentage of agreement of 
judges in the interpretation of responses. This criterion, however, does 
Not satisfy the standard definitions of reliability. In regard to the 
Rorschach at present, we may say that some researches indicate satis- 
factory reliability, as usually defined, while others do not; but there is 
a high degree of consistency among Rorschach examiners regarding 
interpretation of response data. 


Validity. It is entirely reasonable to ask how we know that the stated 
relationships exist between the scoring categories and the personality 
traits they are said to indicate. How is it known that color responses, 
for example, have the emotional significance attributed to them, or 
that movement responses are related to richness of associations and 
imaginativeness? The answers to these and similar questions are to be 
found in the fact that Rorschach himself devised and used the inkblots 
clinically in the diagnosis of personality and experimentally with per- 
sons of known personality traits. In so doing he found the response 
differences reported. Since the publication of his test in 1921, his 
disciples have also attempted to establish its validity by similar means. 
Known Groups. There are several types of validating studies.” The 

first meth sed by Rorschach, is the intercompari 
known Mie in intelligence and scoreline 
ganiza . For 


* Cf. M. R. Hertz, “Validity of the Rorschach Method,” Americ. 
of Orthopsychiatry, Vol. 11, 1941, pp. 512-519. FIRE GEG) 


524 Projective Methods 


the most part Rorschach used mental patients who presented rather 
clearly discernible extremes of certain traits. In addition he tested 
artists, scholars, persons of average abilities, and some mental de- 
ficients. From group to group he and other investigators found sig- 
nificant differences in responses and in characteristic patterns. For 
example, in the neurosis pattern, the following distinguishing signs, 
among others, are some of the most important reported: very few 
movement responses, color shock, shading shock, few or no form- 
color responses, noncombined form responses constituting fifty per- 
cent or more of the total number of responses, refusals to one or more 
cards, small total number of responses. This pattern is interpreted as 
signifying lack of social adaptability, excessively rigid control, sup- 
Pression of spontaneity and originality, and anxiety. The value of this 
method of validation depends, in the first place, upon the accuracy of 
the clinical diagnoses of the Subjects whose Rorschach records are 
compared with their clinical classification (based upon their case his- 


tories). Results thus obtained from numerous clinical reports are con- 
sidered reasonably satisfactory.** 


Since Rorschach’s original work, m 


any and varied validating meth- 
ods have been used. These may be cl 


assified as follows: 


1. Rorschach diagnoses compared with diagnoses by psycho- 
therapists and clinical interviewers 


2. Rorschach findings compared with consistent observation of 
behavior over an adequate period of time 

3. Matching Rorschach interpretations with clinical case reports 

4. Comparing Rorschach Protocols obtained before and after 
therapy 

5. Single Rorschach variables, or 
lated to observed aspects of behavior 

6. Experimental validation: (a) 
varying the stimulus; (c) relation of 
ological activity 


a combination of a few, re- 
influencing the subjects; (b) 
Rorschach responses to physi- 


Each method will be briefly described and 


general results briefly 
noted. 


25 A detailed summary and a bibliography are in Bell, op. cit., Pp. 138-149, 
More recent selected references are included in subsequent footnotes in this 
chapter. 


The Rorschach Test 525 


Comparisons of Diagnoses. Individuals are examined and diag- 
nosed with the Rorschach. Also, another staff member makes a 
psychiatric diagnosis after interviews. Findings are compared. In one 
study of 26 children referred to a clinic, the two sets of diagnoses 
agreed in 62 percent of the cases before psychotherapy. A year later, 
Rorschach and psychiatric diagnoses agreed in 89 percent of the 
cas all the shifts having been made in the psychiatric classifica- 
tion.” Although not all reports are as favorable as this one, the re- 
sults in general are reasonably satisfactory—sufficiently so that the 
Rorschach is widely used in clinics as a diagnostic aid. 

Comparisons with Observations. Behavioral observations of se- 


lected individuals are made continuously over a period of time by 
The subjects observed are usually maladjusted 


qualified persons. ‘ m 
children or adults diagnosed as pathological, or non-clinical school 
children. These observations are compared with Rorschach findings in 
respect to certain aspects of personality, such as intellectual function- 
ing, anxiety, emotional expression, etc. The observations may be 
made in a camp for children, a community or recreation center, in a 
school, and the like. The results of studies of this type have provided 
only partial validation.*” In one of the more favorable reports in this 
area of validation teachers” descriptions and ratings of personalities of 
thirty children (non-clinical cases) were matched with the Rorschach 
responses of these subjects. The correspondence found was at the 
One-percent level of significance (which means that the relationship 
would occur by chance only once in a hundred instances). 

Matching. The most frequently used method of studying validity 
involves comparisons of Rorschach reports with clinical case reports. 
Some of the studies have used deviant personalities; others have used 


2# M. G. Siegel, “The Diagnostic and Prognostic Validity of the Rorschach 
Test in a Child Guidance Clinic,” American Journal of Orthopsychiatry, Vol. 
18, 1948, pp. 119-133. Also J. D. Benjamin and F. G. Ebaugh, “The Diagnostic 
Validity of the Rorschach Test,” American Journal of Psychiatry, Vol. 94, 
1938, pp. 1163-1178; M. R. Hertz and B. B. Rubinstein, “A Comparison of 
Three ‘Blind’ Rorschach Analyses,” ibid., Vol. 9, 1939, pp. 295-314. 

3% R, A, Young and S. A. Higginbotham, “Behavior Checks on the Rorschach 
Method,” American Journal of Orthopsychiatry, Vol. 12, 1942, pp. 87-94: 
J. L. Singer and H. E. Spohn, “Some Behavioral Correlates of Rorschach’s 
Experience-Type,” Journal of Consulting Psychology, Vol. 18, 1954, pp. 1-9; 
J. W. Swift, “Matchings of Teachers’ Descriptions and Rorschach Analyses of 
Preschool Children,” Child Development, Vol. 15, 1944, pp. 217-224. 


526 Projective Methods 


rea is Krugman’s.* Rorschach a 
en were matched by 5 judges with 
125 pairings, a contingency coef- 
udges, the average of correct match- 


Tegard to each of the following: 
tional aspects; (3) diagnosis; (4) 
ent was found in 73 percent of the 
comparisons, fair agreement in 21 percent, and slight agreement in 
the remainder. These are highly favorable findings. 


Changes after Treatment, When Rorschach reports obtained be- 
fore and after various forms of treatment are compared, as a method 
of validating the test, the hypothesis is that the test record should re- 
flect the personality changes that have taken Place in the interim as â 
result of therapy. Two studies—one involving Psychoanalysis and the 
other involving insulin treatment—wil] be mentioned."* In the first, 36 
persons were involved, for all of whom there were significant changes 
as between the “before” and the “after” Rorschach records; and these 
changes were related to the trends reported by the therapist: for ex- 
ample, improved emotional control, improved intellectual function- 
ing. On 14 out-patient subjects, the Rorschach examiner and the 
therapist were in essential agreement regarding i 
case of 22 hospitalized patients, th 
regard to direction and degree of in 
the therapist reported “social recovery” (a nebulous matter), whereas 
their Rorschach responses showed no improvement, 

When insulin treatment was used on schizophrenics, those who 
7 “J, I. Krugman, “A Clinical Validation of 


g the Rorschach with Problem 
Children,” Rorschach Research Exchange, Vol. 6 


; » 1942, pp. 61-70. Also, W. 
Goldfarb, “Rorschach Test Differences between Family-reared, Tnstingon= 
reared, and Schizophrenic Children,” American Journal of Orthopsychiatry. 
Vol. 19, 1949, pp. 624-633. 

#2 M. J. Rioch, “The Use of the Rorschach Test in the Assessment of Change 
in Patients under Psychotherapy,” Psychiatry, Vol. 12, 1949, Pp. 427-434; 
Z. A. Piotrowski, “Rorschach Manifestation: 

Schizophrenics,” Psychosomatic Medicine, Vol. 1, 1939, PP. 508-526: C. 
Windle, “Psychological Tests in Psychopathological Prognosis,” Psychological 
Bulletin, Vol. 49, 1952, pp. 451-482. 


e two clinicians agreed on 10 in 
provement; for the remaining 12, 


The Rorschach Test 527 


showed behavioral improvement also showed improved performance 
on the Rorschach: namely, increase in speed of reaction and of re- 
sponse, improved verbal form and logical content, clearer perceptions 
and more relevant responses, and improved emotional control. 
Separate Variables Related to Behavior. Without going into the 
details and tentative findings in this area of validation, we shall indi- 
cate the kinds of studies undertaken and the general conclusions. This 
is called the “molecular” approach because, instead of making a 
“molar” or “global” interpretation, only one or a few Rorschach 
variables are studied in relationship to selected aspects of behavior. 
The Rorschach variables most frequently investigated have been color, 
Movement, and form responses as they are related to certain aspects 
of behavior, such as introversion-extraversion (as measured by a 
Personality inventory), intellectual capacity (using a test of intel- 
ligence ) .** On the whole, the results in this area have been inconclu- 


sive, since the functions being measured by the Rorschach and by the 


criteria are not necessarily identical or similar. 
Experimental Validation. This method takes several forms. The 


subjects used may be influenced through experimentally induced ten- 
sions, through hypnosis or drugs, through brain-surgery, or through 
electric shock treatment. When the Rorschach test is administered “be- 
fore” and “after,” it is possible to estimate the effects of the changed 
conditions, The general conclusion is that the findings are favorable as 
indications of validity." There are two principal obstacles to the em- 
ployment of this method: first, the number of subjects is necessarily 
small in each study, due to the great amount of time required by each 
case and to the relative difficulty of obtaining subjects; and, second, 
the paucity of scientific information as to organic changes in subjects 
treated by means of surgery, drugs, or electric shock. 

Experimental studies in which the stimulus has been varied have 
been concerned almost entirely with the influence of color upon re- 


40 Bor example; R- W: Gardner, “Impulsivity as Indicated by Rorschach 
Test Factors,” Journal of Consulting Psychology, Vol. 15, 1951, pp. 464-468; 
J. Wishner, “Rorschach Intellectual Indicators in Neurotics,” American Journal 
of Orthopsychiatry, Vol. 18. 1948, pp. 265-279. 

4 For example, E. Stainbrook, “The Rorschach Description of Immediate 
Post-convulsive Mental Function,” Character and Personality, Vol. 12, 1944, 
pp. 302-322: M. Williams, “An Experimental Study of Intellectual Control 
Under Stress and Associated Rorschach Factors,” Journal of Consulting 
Psychology, Vol. 11, 1947, pp- 21-29. 


528 


sponses to the inkblots.** The purpose of the experiments was o a 
the stimulus value of color in the cards and the hypothesis of “co a 
shock.” Although the results have not been entirely consistent, De Š 
researches have raised serious doubts concerning the earlier concep 
tions of the role of color in this test, In response to these largely aed 
tive results, some clinicians Point out that, while the reported data a 
statistically relevant to the isolated stimulus of color, they are E 
necessarily relevant to the interpretation of color responses me 
context of the whole. They emphasize, also, that the statistical on S 
ings refer to group trends but do not negate the fact that in some cases 
the color of the cards has a disruptive influence." 

The third experimental approach is to study the relationships be- 
tween Rorschach responses and changes in physiological activity. For 
example: what is the relationship between “color shock” responses 
and galvanometric response? What are the effects of experimentally 
induced stress, cardio-vascular activity, and Rorschach responses: 


Thus far, the relatively few available experimental reports have dem- 
onstrated that there is a degree of 


association between autonomic 

responsiveness and Rorschach data. 
Summary, Although specific 
fered in detail, they are essentially of two broad types: experimental 
and matching, the | ing been the more widely used and the 


Projective Methods 


ghtly from the other two: (1) inter- 
comparisons of Rorschach responses of known groups; (2) “blind 
diagnoses of individual cases, followed by Comparisons with diagnoses 
matching, of an 


#3 For example, J. A. Perlman, “Color and the y: 
8-9-10 Percent,” Journal of Consulting Psy 
126; H. Sanderson, “Norms for Shock in the 

#6 For example, E. M. Siipola, “The Influ 
Blots,” Journal of Personality, Vol. 18 

#1 D. Brower, “The Relation between Certain Rorschach Factors and 
Cardiovascular Activity before and after Visuo. a A ne 
General Psychology, Vol. 37, 1947, PP. 93-95; M. R. Hertz, “Current Problems 
in Rorschach Theory and Technique,” Journal of Projective Techniques, Vol. 
15, 1951, pp. 307-338. 


alidity of the Rorschach 
chology, Vol, 15, 1951, pp. 122- 
Rorschach,” ibid., pp. 127-129. 


The Rorschach Test 529 


individual's Rorschach record with extensive clinical study and diag- 
nosis, to determine points of agreement and disagreement. If valid, 
then, the inkblot test can be used to obtain a diagnosis in much less 
time, or to supplement or confirm a diagnosis arrived at by other 
means. Rorschach experts maintain that innumerable investigations 
support the clinical value of their instrument. They have found, 
furthermore, that results obtained by the several methods of valida- 
tion indicate the Rorschach test to be useful in revealing threatening 
or unwholesome trends in personality development before very serious 
difficulties actually appear. If this forecasting power of the instrument 
can be more definitely established, the Rorschach then will become 
especially valuable for purposes of mental hygiene and preventive 
Psychological treatment. For more definite determination of its fore- 
casting quality, however, an adequate number of longitudinal studies 


Over an appreciable period of the subjects’ lives will be necessary. 


These are as yet not available, for their achievement is beset by many 


difficulties, 


Group Methods. It was to be expected that sooner or later, efforts 
Would be made to devise methods of group administration of Ror- 
schach’s individual test, as was the case with intelligence tests. Several 
techniques have been suggested and are now finding application in 
Several practical fields while, at the same time, being subjected to 
further experimentation. : 

In one instance, the Rorschach blots—on slides—are projected on 
a screen, each for a specified time, the subjects being required to 
write out their responses. They are later asked to mark the blot re- 
Productions on their blanks and to answer a series of questions so 
be scored according to the standard cate- 


that their responses may 


Sories,*8 aa 
A second method differs from the preceding in that the subjects 


are provided with a list of responses for each blot from which to 
choose (multiple choice). A modification of the multiple choice 
item has been suggested: namely, that the several responses (some 
“normal” and some “neurotic”) be rated by the subjects themselves 
in order of applicability or likeness to the particular blot to which 


8 M. R. Harrower-Erickson, Large Scale Rorschach Techniques: A Manual 
for the Group Rorschach and Multiple Choice Test, Springfield, Ill.: C. C. 
Thomas, 1945. 


530 Projective Methods 
they are attached. The sum of the ordin 
“neurotic” responses constitutes the score f 
are, of course, the more favorable since “n ia 
desirable and should be rated lower in the order of applicability j 

Another group procedure being explored is to have the subjects 
self-administer the test, following prepared instructions. 

From studies thus far reported, it appears that the group methods 
of Rorschach testing are less successful than the individual method, 
which permits free, unlimited response, and during which the €x- 


aminer is able to make observations of the subject’s behavior and at- 
titudes. In several fields, how 


methods have been reported: 


; e 
al values assigned to pe 
or each blot. Higher corn 
eurotic” responses are les 


; c ssful workers, comparative studies be- 
tween racial groups in the United States, and in studying personalities 
of artists and scientists,“ = 


i a 
cessarily conform 
aPping of scores and characteris 


n, there are some research data which 
tA few Studies report only limited 
thods in fields of application such as 
can be demonstrated that the ex- 

cal analyses are seriously defective in 

one group of reports or the other—and this does not appear to be the 


it cen ee 
i for 
Neurotics,” Psychological Bulletin, V, a gs cteening Tests 
#R. L. Munroe, * i h a Self-Administers, f the 
Rorschach and Group Administration ministering Form o 


39H, J. Eysenck, udy of 
ol. 42 1945 


by Exami ` aming” 
Rorschach Research Exchange, Vol, 10, 1946, pp. LEC without Training 
‘| For example, M. E. Steiner, The Psychologist nÈ 


Ill.: C. C. Thomas, 1949; A, Roe, “Psychological Examinations of Eminent 
Biologists,” Journal of Consulting Ps chology, Voj, 13, 1949, pp. 225-246. 

* For example, R. G. Anderson, “Rorschach Test Results and Efficiency 
Ratings of Machinists,” Personnel Psychology, Vol. 2, 1949, pp. 513-524; 
A. K. Kurtz, “A Research Test of the Rorschach Test,” ibid., Vol. 1, 1948, 
pp. 41-51. 


ndustry, Springfield, 


The Rorschach Test 531 


case—then the problem is to determine the reasons for the dis- 
crepancies in results. Why are group methods successful in some 
situations but not in others? Further detailed research is necessary to 
answer this question. Such research may reveal not only defects in 
the group Rorschach, but they may show, for example, that “success- 
ful” persons in some occupations may represent a wide variety of 
Personality patterns, or that the reliabilities of the criteria (e.g., fore- 
man’s ratings, ratings by deans of students) are too low. In the mean- 
time, it can be said that group Rorschach methods have been used 


effectively by some of the specialists in the field. 


Evaluation of the Rorschach Test. The Rorschach inkblot method 
has been shown to have its greatest usefulness in revealing markedly 
deviant personalities. Its value in differentiating among individuals 
within the large groups in the middle ground between extremes is 
limited. This, however, is more or less true of all psychological tests, 
for two main reasons: differences between individuals within the 
middle groups are not so pronounced, hence they are more difficult 
to measure or assess; the instruments are not sufficiently refined to 
detect the finer differences. There is a third possibility: namely, im- 
proved methods of scoring categories, improved interrelationships and 
sounder interpretations can add to the value of the instrument. 

It is pertinent to add, at this point, that often the Rorschach test 
and Rorschach specialists are asked by critics to demonstrate levels 
of achievement and of clinical and predictive value such as are not 
imposed upon non-projective devices, nor, for that matter, upon non- 
Psychological clinical techniques. 

Enthusiastic Rorschach exponents recognize that the test is in 
need of further clinical and experimental research, particularly in re- 
spect to scoring, normative studies regarding a variety of factors 
(e.g., age, sex, cultural and economic status), and longitudinal stud- 
165.9 


43 To score for maladjustment, R. L. Munroe has suggested the use of a list 
of indicators of a kind which would give greater objectivity to scores. See “The 
Inspection Technique: A Method for Rapid Evaluation of the Rorschach 
Protocol,” Rorschach Research Exchange, Vol. 9, 1945, pp. 46-70. For a 
suggested quantification and standardization of Rorschach scores, see C. 
Buhler, K. Buhler, and D. W. Lefever, Development of the Basic Rorschach 
Score, with Manual of Directions, 1948 (published by the authors, in mimeo- 
graphed form, University of Southern California). 


532 Projective Methods 


Adverse criticism of the Rorschach, from others than its exponents, 
has not been lacking. Critics have dwelt on its inadequate objectivity, 
reliance on personal norms, limited validity, restriction to clinical use, 
and even “cultism.” Although these criticisms are warranted to some 
degree, the fact is that Rorschach exponents themselves have not been 
unaware of the problems and the unanswered questions; for the litera- 
ture is replete with their critical publications on the test’s reliability, 
validity, clinical usefulness, guidance value, case studies, and predic- 
tive value. With such an approach to the test, it was inevitable that 
progress in the last decade, both scientific and clinical, should be 
considerable in the application of the Rorschach to the psychological 


study of personality and perception and to understanding malad- 
justed persons.“ 


M. L. Hutt has stated the matter neatl 
practice is, at present, an art. Art, I bel 
integral part of clinical Practice. Howe 
clinical work will, in time, become 
theory, technique and criteria reac 
logical development. Meanwhile, I frankly admit the importance of 
subjective norms, clinical judgment, and Subtle influences of intui- 


tion. These . . . must be given full play in clinical work in reaching 
working hypotheses about the [person]. At the same time I rely as 
much as I can upon all the scientific 


y and reasonably: “Clinical 
ieve, will always remain an 
ver, the scientific aspects of 
a major portion of this practice as 
h a more mature state of psycho- 


> 


and utilize these in refining my huncl 
I test these clinical hunches against biographical d 
havior, and all other evidence accumulated about the [person]. In short, 
I recognize that as a clinician I have two roles to play: the artist and 
the scientist. I use the former in getting to know the [person] and use 
the latter to correct my impressions as well as I can. . . . Each role 
has its place and . . . we must be careful not to confuse the two.” # 


ata, clinical be- 


*# See the following for a detailed 
Heimann, “Development and Applicati 
Chapter 5 in Educational and Psychologic 
search, Vol. 23, No. 1, 1953, Also, H. Sargent, “Projective Methods: Their 
Origins, Theory, and Application in Personality Research,” Psychological Bulle- 
tin, Vol. 42, 1945, pp. 257-293, 


45 “The Assessment of Individual Personality by Projective Tests: Current 
Problems,” Journal of Projectiyve Techniques, Vol. 15, 1951, pp. 389-390. See 


also F. Halpern, A Clinical Approach to Children’s Rorschachs, New York: 
Grune and Stratton, 1953, passim, 


Vw 


Thematic Apperception Test 53 


THEMATIC APPERCEPTION TEST * 

Description and Procedure. Commonly referred to as the 
TAT, this projective method consists of thirty pictures plus one blank 
card. The cards are used in various combinations, depending upon sex 
and age. Some cards are used with all subjects, while others are used 
with only one sex- or age-group. The total number of pictures used 
with any subject is twenty, usually administered in two sessions, ten 
each time. 

The person being examined, according to Murray’s instructions, 
is told that this is a test of imagination, that he is to make up stories 
to suit himself, and that there is no right and no wrong response. 
The pictures are shown one at a time, accompanied by simple in- 
structions. The subject is informed that each card shows a scene. He 
is asked: (1) to tell what he thinks led up to the depicted scene; 
how it came about; (2) to give an account of what is happening 
and the feelings of the characters in the picture; (3) to tell what 
the outcome will be. There are no time limits; in fact, the subject 
is encouraged to continue for as long as five minutes on a picture. 
The acoaunt should be recorded verbatim if possible.” It is recom- 
mended that the testing be followed with an interview to learn the 
Origins of the stories, seeking especially associations to places, names 
of persons, dates, specific and unusual information. This is an im- 
Portant aspect of the process because it enables the examiner to 
clarify meanings of stories and more reliably to evaluate their signifi- 
cance, since the subject’s accounts are not only a product of his inner 
Personality traits, but may be a more or less superficial reflection of 
cultural forces (radio, movies, comics, current events, reading mate- 
rials, etc.). For instance, a girl ten years of age made up an unex- 
Pectedly large number of stories dealing with crime and mystery. 
She had been listening regularly to a radio mystery serial. The fre- 


“HL A, Murray, Thematic Apperception Test Manual, Cambridge: Harvard 
University Press, 1943. See also his Explorations in Personality, Oxford: Oxford 
University Press, 1938. This test was introduced in 1935 by C. D. Morgan and 
H. A. Murray in “A Method for Investigating Fantasies: the Thematic Apper- 
Cetio Test,” Archives of Neurology and Psychiatry, Vol. 34, 1935, pp. 289- 

6. 

ki As in the case of the Rorschach, some specialists in the use of the TAT 
have introduced their own variations in giving instructions; but basically most 
follow the formula here given. 


at Projective Methods 


: ey! ; ees 
quent or compulsive utilization of recent environmental experien x 
however, is of significance in interpreting a subject's reports, by era 
of the fact that the person has utilized them as representing a con 


on a preconscious level, or as a symbol on an unconscious level. 


ric. 19.3. A Picture 
ae 


from the Them 
tion I 


Harvard University P 
sion.) 

Although pictures are not unstru 
inkblot, those used in the TAT seri 
there is wide latitude for manifestations of individual differences in 
responses. The TAT is, like the Rorschach, a projective method, 
but there is a fundamental difference between these two, While the 
latter is intended to reveal the structure and organization (or dis- 
organization) of an individual's personality, the former is devised 
to bring out primarily the content of one’s personality: the drives, 
needs, sentiments, conflicts, complexes, and fantasies. The test is 


atic Appercep- 
ress, (By permis- 


ctured to the same degree as an 
es are sufficiently ambiguous that 


Thematic Apperception Test 535 


based upon the principle that when a person interprets an am- 
biguous social situation, he is apt to reveal aspects of his own per- 
sonality that he otherwise will not admit, or cannot admit because 
they are unconscious. The subject, being absorbed in the picture and 
attempting to construct an appropriate account of it, is “off his 
guard” and becomes much less aware or quite unaware of himself 
in the situation. In creating stories based upon equivocal pictures, 
the individual utilizes and organizes content of his own personal ex- 
periences. Everything the subject says is regarded as having meaning. 
From these stories, “the skilled examiner and interpreter draws in- 
ferences regarding the subject's personality traits and their organiza- 
tion. 


Interpretation of TAT stories may be made 
in any one of several ways, depending upon the viewpoint of the 
€xaminer and the purpose of the testing. But in all instances, the 
details of an individual's stories must be viewed against known facts 
being studied. The stories should not be interpreted 


Analysis of Stories. 


of the personality 


in vacuo, ; s : 
In some cases, rereading the stories several times will reveal the 


subject’s basic problems, for repetitive patterns may be found through- 
out; or it may be found that facts and aspects of different stories con- 
stitute a meaningful whole. Or one may make a minute analysis in 
accordance with the scheme provided by Murray, or in accordance 
With modifications thereof. We shall briefly outline two schemes of 
analysis so that the reader may see more clearly the purpose of the 
TAT. All schemes, however, have this in common: they are intended 


tov Aisclose personality content, from which the interpreter judges 


Personality organization. 

Murray recommends that the content of stories be analyzed into: 
(1) the forces emanating from the “hero,” and (2) the forces ema- 
nating from the environment. These two divisions are analyzed under 


the following six categories. 


(1) The hero: the character in each Picture with whom the subject 
identifies himself; in whom the subject is most interested: whose point 
of view, feelings, and motives have been most intimately portrayed. 
The heroes are to be characterized by the interpreter according to 
their principal, or idiomatic, traits. (E.g., solitariness, leadership, su- 
periority, criminality, etc.) 

(2) Motives, trends, and feelings of the heroes: analysis in detail 
of everything cach of the heroes feels, thinks, and does; noting espe- 


536 Projective Methods 
cially the unusual, the high frequencies, the high and low eager 
Under this category, Murray lists numerous variables (traits) w at 
are scored on a scale from one to five, on the basis of their snn 
as expressed through intensity, duration, frequency, and sin age E 
in the plot. (Eg, abasement, achievement, dominance, conflict, 
jection.) p 
nvironment: the general nature and de- 
ally human situations, noting especially 


such forces (more than thirty) have 
» (E.g., rejection, physical injury, 
dominance, lack, loss.) The strengths of these are rated on a scale 
of one to five. : 

(4) Outcomes: the comparative strength of the forces emanating 
from the hero and the Strength of those from the environment; the 
amount of hardship and fr 


ustration experienced; relative degrees of 
success and failure, happy and unhappy endings. 
(5) Themas: interaction of 


forces, together with the successful or unsuccessful outcome for the 
hero, constitute a simple thema. ations or sequences of these 
are called complex themas. The t a, simple or complex, is actually 
a synthesis of the elements analyzed under the first four categorics, 
the purpose being to view the several forces in their interrelationships 
and thus to determine the most prevalent Problems, in a given case, 
arising from internal needs and external forces, 

(6) Interests and Sentiments: choice of topics and manner of deal- 
ing with them, displayed especially by the Positive and negative value 
or appeal of various elements in the Pictures, (E.g., older women 


who may be mother figures, older men as father figures, same or Op- 
posite sex.) 


hem 


Tomkins, a former associate of Murray’s, has devised his ow? 
scheme of analysis. He justifies and differentiates it from Murray’s in 
this way: “Its rationale consists in tapping varying levels of abstraction 
in the hope that significant aspects of diverse types of protocols will 
be detected by the use of concepts which Tange from a level of broad 
generality to a high degree of differentiation.» Each story is scored 
under four main categories: vectors, levels, conditions, qualifiers." 


SS. S. Tomkins, The Thematic Apperception Tes. New E 
Stratton, 1947, Chapter 3. Another and diff analytic outline by a former 
associate of Murray’s is by L. B. Bellak, A Guid 
Thematic Apperception Test, Psychological Corporation, 1947. See also, J. L 
Lasaga y Travieso and C. Martinez-Arango, Some Suggestions Concerning 
the Administration and Interpretation of the TAT, Journal of Psychology; 
Vol. 22, 1946, pp. 117-163. 


Thematic Apperception Test 537 


(1) Vectors: the psychological direction of behavior, drives, feel- 
ings, etc. The vectors, of which ten have been listed, may have as 
their objects any thing or person or idea of human interest. Vector 
means a field of force, or magnitude and direction of force. (E.g., 
the vector “against,” to attack objects; the vector “toward,” to ap- 


Proach or enjoy objects.) f R ; 
(2) Levels: the plane of psychological function involved in the 
story, seventeen being listed. (E.g., object description, intention, 


wish, night dreams.) $ ; ; 
(3) Conditions: any psychological, social, or physical state which 
is not in itself behavior, striving, or Wish: that is, conditional quali- 
ties of behavior. Two major divisions have been made: (a) states 
with negative factors or forces—called valences: such as lacks, loss, 
danger, inner conditions (depression, anxiety): (b) states with posi- 
tive or neutral factors: such as abundance, security, moderation, in- 


ner conditions (optimism, certainty). x 
(4) Qualifiers: the specific aspects of the first three categories: 
namely, temporal characteristics (past, present, _ future, dura- 
tion of an episode); contingency (degree of certainty); intensity 
(strength of items in story); negation (any type of denial); sub- 
Sidiation (any means-end relationship); causality (any causal rela- 


tionship). 


Any word or statement in the subject's stories may, of course, be 
classified under one or more of the major categories and their sub- 
divisions, as given above. This method of analysis is exceedingly de- 
tailed and laborious; but its author urges that it often yields insights 
Not otherwise obtainable. : y : 

Comparison of the two foregoing analytic outlines, and comparison 
of these with other schemes, make it obvious that the TAT is not an 
objective test in the sense that tests of intelligence and specific aptitude 
are objective; first, because the details of the stories are largely evalu- 
ated and classified rather than scored; and second, because there are 
as yet no uniform standards or criteria which can be objectively ap- 
plied for evaluating oF scoring. We are not maintaining that an instru- 
ment like the TAT should or can conform to the standards and criteria 
of objectivity in scoring and reporting employed with other types of 
Psychological tests. But the fact that the TAT does not so conform 
explains why its interpreters report their results in somewhat different 
terms and why its use requires acute psychological insights on the part 
of interpreters, particularly into the psychology of human needs and 
motivation. It is to be hoped, however, that before too long, after 
evaluations of the effectiveness of the several analytic schemes in 


53 Projective Methods 
22 


actual practice have been made, a single plan of analysis will be 
achieved. n ; 
Regardless of the particular scheme used in making an analysis, the 
results are interpreted as representing, literally or symbolically, tend- 
encies and traits of the subject’s personality, belonging to his past or 
present, or projected into the future. The results are interpreted, also, 
as representing, literally or symbolically, effective forces in the sub- 
ject’s environment, his views of the world, his past experiences, his 
anticipations of the future. The results and conclusions reached by any 
analysis are to be used as an hypothesis to be checked against other 


sources of information and as a starting point for further psychological 
interview, counseling, or treatment. 


Reliability.” TAT reliability has been studied in three ways: 


1. extent of agreement among interpreters of the same stories in 
regard to traits of the persons examined; 


2. similarities between stories on repeated examinations of the 
same persons; 
3. split-half method, correlating frequenc 


y and intensity of needs 
expressed in the stories. 


Agreement among Interpreters. The first of these methods is de- 


pendent, in part, upon the reliability of the interpreters, Published re- 
ports indicate that agreement among interpreters is greater if they have 


had similar backgrounds in regard to training and if they use similar 
systems of analysis and scoring. This 
should be; namely, 


criteria of analysis 
to the prescribed a 
for tests of intellige 


situation, of course, is only as it 
the interpreters approximate uniformity in their 
and ratings. Such a situation is an approximation 
nd uniform scoring standards and criteria provided 
nce and of specific aptitudes, 


Studies of agreement among interpreters, using for the most part 


™ See, as examples: R. Clark, A Method of Administering and Evaluating 
the Thematic Apperception Test in Group Situations, Genetic Psychology Mon- 
ograph, No. 3 21944; A. W. Combs, “The Validity and Reliability of Interpre- 
tation from Autobiography and Thematic Apperception Test,” Journal of 
Clinical Psychology, Vol. 2, 1946, Pp. 240-247; R. Harrison and J. B. Rotter, 
“A Note on the Reliability 


. of the Thematic Apperception Test,” Journal of 
Abnormal and Social Psychology, Vol. 40, 1945, pp. 97-99; S. L. Garfield and 
L. D. Eron, “Interpreting Mood and A 


: ctivity in TAT Stories,” Journal of Ab- 
normal and Social Psychology, Vol. 43, 1948, pp. 338-345; M. Mayman and 
B. Kutser, “Reliability in Analyzing TAT Stories,” ibid., Vol. 42, 1947, PP- 
365-368. 


Thematic Apperception Test 539 


rank-order correlation and the coefficient of contingency, have re- 
ported coefficients ranging from approximately +.30 to +.90. 

When percentages of agreement among interpreters was the method 
used, clinicians have agreed completely on interpretations of from 50 
to about 75 percent of the stories. In addition, there was essential 
(though not detailed agreement) in from 10 to 25 percent, On the 
remaining stories there was only partial agreement. 

The foregoing data reflect, in part, differences in systems of analysis 
and differences in ability and experience of interpreters. The results of 
f also by the complexity and often in- 
Considering the nature of the 
this method are very en- 


these researches are influenced 
tangibility of elements in the stories. 
problem, the reliability results obtained by 


couraging. 
Test-retest. Reliability data obtained by means of the test-retest 


affected by the stability of the personalities being exam- 
ined and by personality changes as a function of time. The greater the 
time interval, the lower reliability we may expect, because there will 
be more opportunity for the influence of intervening forces. In view of 
the fact that it is to be expected that personalities often will change 
with the passage of time as developmental conditions change, espe- 
cially in the case of children, of adolescents, and of persons who are 
Clinical subjects, the more significant reliability data are those based 


upon degree of agreement of judges and those of test-retest after a 


brief interval, Tomkins * reports a reliability coefficient of +.80 after 


an interval of two months, using fifteen young women as subjects; 
+.60 after an interval of six months, using a different group of fifteen 
Comparable subjects; and +.50 after ten months for a third group. 
While these last two coefficients indicate that there were significant 
changes in ratings of some subjects, the indexes represent group 
trends, and do not signify that all subjects showed important changes. 
In fact, it was found that the time sane Sala tests had little or 
noe ings of stable personalities. 

Cera ae a “Using the split-half method, Sanford * ayh 
reliability coefficients of .48 and 46. The responses were quantifie A y 
analyzing the stories for frequencies and rating intensities of m a 
and “press” elicited by the pictures. As reliability coefficients, thes 


= Tomkins, op. cit. PP: 6-8. 
th R. N. Sanford et al., Physiq! 
© Society for Research in Child Dev 


method are 


e. Personality. and Scholarship. Monograph of 
elopment, No. 8, 1943, p- 263. 


o Projective Methods 
54 


are ordinarily too low to be of considerable significance. Under the 
circumstances, however, it is a matter of surprise that the coefficients 
are even this high, because the split-half method is inappropriate for 
the TAT. The method is inappropriate because not all of the pictures 
necessarily assess the same “needs” and “press,” nor are they intended 
to do so. Although several pictures may to some extent elicit the same 
or similar “needs” and “press,” each is intended to have a different 
stimulus value. The split-half method has been virtually abandoned in 
the study of TAT reliability. 

Tomkins reported results obtained with one person who was studied 
intensively over a period of ten months, from whom about four hun- 
dred stories were obtained, using TAT pictures and others also. One 
interpreter rated the traits revealed in half the stories, while another 
rated the other half. The obtained correlation coefficient was +.91. 

Of the several methods used in studying reliability of the TAT, 
interpreter agreement is the most useful and most significant; for in 
employing a projective test, what we want to learn primarily and as 
reliably as possible is the content and organization of a personality as 
it is at the time of examination. Whether that personality will remain 
the same and whether we can predict what that personality will be in 
time will depend upon the degree of stability of the individual being 
tested and upon the nature of the factors in his environment, 


Validity. Essentially, several methods of study 
been employed: 


1. comparison with past histories and/or with results obtained 
through an intensive case study employing a variety of techniques; 

2. comparison of characteristics of known individuals or groups 
with their TAT records; 


3. comparison of TAT findings with other clinical materials: the 
subject’s Rorschach record, dreams, or psychoanalytic interpreta- 
tions; 


ing TAT validity have 


4. experimentally produced changes. 


Since most of the studies reported below do not fall e 
category or another, they are not being 
foregoing numerical classification, 


xclusively in one 
presented according to the 


Matching. Of these, the first method has been explored more tha? 
the others, Harrison’s studies being regarded as among the most sig 


Thematic Apperception Test 541 


aig Forty patients at a mental hospital were given the TAT, 
ime t any prior knowledge, on the part of the examiner, regarding 
ae iR: On the basis of their stories and behavior during the 
it g, Harrison drew his inferences concerning the personality de- 

pment, traits, attitudes, level of intelligence, personal problems, 
and conflicts of each subject. These inferences were checked by an- 
other person against the hospital records. A correlation of +.78 was 
obtained between estimated and obtained IQ's; 82 percent of the in- 


in somewhat over 75 percent of the cases, diag- 


fer Re 
ences were correct: 
egories; when 18 


nostic classification was correct, using the major cat 
Cases were classified into clinical subgroups, the percentage of agree- 
Ment with clinical classification was 67. In order to eliminate infer- 
ences drawn from observation of the subject’s behavior during the 
testing session, Harrison had another examiner give the TAT to 15 
Patients; then he made a “blind” analysis himself. In this instance, his 
inferences were 74 percent correct, when compared with already 
known biographical and personality data, This drop in correspondence 
(from 82 to 74 percent) indicates the value of using behavioral ob- 
Servations in conjunction with projective-test results. 
In another and unusual matching experiment, an adaptation of the 
T was administered to groups of Navaho and Hopi Indians.“ On 
€ basis of their stories, “blind” interpretations were made of the 
Personality traits of the people of these two cultures. Anthropologists 
Amiliar with the two Indian societies found the personality analyses 
ased upon TAT results to be in essential agreement with their facts 
"8arding these North American Indian cultures. 
Preliminary investigation in India, using a modified and adapted 
Orm of the TAT, yielded highly promising results regarding the test's 
*PPlicability to the study of social problems in that culture. The results 
Make a positive contribution also to the question of validity; for a very 
= Percentage of the stories dealt with the basic problems of survival 
ny “succorance”). intra-family relationships (between the 
Le and the family of the husband in particular; Murray's “submis- 


ER Ta = 3 j eal Use and Validity of the Thematic Appercep- 
tion Tegi rikon, Saa dered Patients. II. A Quantitative Validity Study,” 
acter and aaa om “Vo . 9, 1940, pP- 122-133; HI. “Validation by Blind 
sa Ysis?” ibid., py 134-138. ‘Also Harrison and Rotter, op. cit.. footnote 49. 

3 Wy d., pp. } shnique in the Study of 


C E hematic Apperception Tec 
yor ee ares. tic Psychology Monographs, Vol. 35, 1947, 


Yersonality Relations, Gene 


Projective Methods 
542 


sion”), and the need for education and Skills to improve one’s lot 
(Murray’s “achievement” and “acquisition” ).* o 
Known Groups. Satisfactory results were obtained when the stories 
of diagnosed groups, of known characteristics, were analyzed in detail 
to determine if significant differences existed among them. The results 
showed that such do exist between the following classifications, con- 


Rapaport, Gill, and Schafer made a qualitative analysis of TAT 


They found trends in 
purposes with groups, 
phrenic.™ 

ave considerable validity in the per- 
Nquents in a juvenile court.” 

» and with favorable results for the 
n-clinical groups, Among these are 
Tom among officer candidates in the 


armed forces,** and the differentiating TAT responses of prejudiced 


and non-prejudiced persons. 
Comparisons with A utobiographies, D 


ata obtained from autobi- 


ade by one of 
nd life-long resident of India 


™ These unpublished studies were m 
ate students, a native a 

5 E, R. Balken and J. H. Masserman, 
Language of the Fantasies of Patients wit 
and Obsessive-Compulsive Neurosis,” Journal of Psychology, Vol, 10, 1940. 
Pp. 75-86; H. Renaud, “Group Differences in Fantasies: Head Injuries, Psycho- 
neurotics, and Brain Diseases,” Journal of Psychology, Vol. 21, 1946, pp. 327- 
346; A. H, Davison, “A Comparison of the Fantasy Productions on the TAT 
of Sixty Hospitalized Psychoneurotic and Psychotic Patients,” Journal of Pro- 
jective Techniques, Vol. 17, 1953, pp. 20-33; C. H. Saxe, “A Quantitative 
Comparison of Psychodiagnostic Formulations from the TAT and Therapeutic 


Contacts,” Journal of Consulting Psychology, Vol. 14, 1950, pp. 116-127. P 
“D. Rapaport, et al., Diagnostic Psychological Testing, Vol. II, Chicago: 
Year Book Publishers, 1946, pp. 439 ff, 


“A. A. Hartman, An Experimental Examination of the Thematic Apper- 
ception Technique in Clinical Diagnosis, Psychological Monographs, No. 303. 


1949. 
M. I. Stein, “Note on the 


the author's former gradu- 


“The Language of Fantasy: II], The 
h Conversion Hysteria, Anxiety State. 


“H. A. Murray and 
Psychosomatic Medicine, Vol. 5 
*% E. Frenkel-Brunswick, 


Selection of Combat Officers,” 
+ 1943, pp. 386-391. 


k “Dynamic and Cognitive Categorization of Quali- 
tative Material: I. General Probl 


ems of the TAT” Joda of Psychology: 
Vol. 25, 1948, pp. 253-260. 


Thematic Apperception Test 543 


ographies have been compared with TAT interpretations, Findings 
demonstrate that some of the pictures (about 30 percent) elicit stories 
that reflect past history better than do others and that, in this respect, 
the most useful pictures are those that contain characters with whom 
the subject is able to identify.” 
j Comparisons with Dreams. Since psychoanalytic theory holds that 
dreams are a medium for expressing fears, wishes, etc., that are re- 
pressed and inhibited, and since the TAT provides an opportunity for 
expressing and elaborating these, it was to be expected that degree of 
Correspondence between dream content and TAT stories would be 
investigated. Though not all the themes of the TAT responses ap- 
peared in dreams of the subjects thus studied, the very few reports 
Available state that the extent of similarity was great enough to give 
added evidence of TAT validity." In this connection, the highly sub- 
jective character of interpreting dreams and dream symbolism must be 
j kept in mind. 
| Agreement with Rorschach Findings. Several investigators have 
TepPorted sufficiently close agreement between the Rorschach and the 
TAT as additional evidence of the validity of the latter.“ To use the 
inkblot test as a criterion of validity for another test is to attribute 
Satisfactory validity to the Rorschach. This method must be evaluated 
in the light of the discussion of Rorschach reliability and validity. 
Intensive Study of Individual Cases. Morgan and Murray report 
that the TAT stories of one patient indicated all the major character- 
istics revealed by five months of psychoanalysis.’ This highly success- 
ful outcome is due in part to the fact that the psychologists making 
both the TAT and the therapeutic studies were psychoanalytically 
Oriented, This fact does not minimize the findings; it emphasizes the 
Possibilities of the TAT when uniform analytical systems are used with 
the test and with outside behavior. i o 
omkins made an intensive study of one person, consisting of 
Seventy-five hours of psychological interview and testing. He con- 


0 “3 

my! a ee and Thematic Apperception Test Stories,” Journal 
= B. Sarason, S = 44 -492 

of Abnormal and Social Psychology, Vol. 39, 1944, pp. 486 492, as wd 

iis ; and R. Harrison. “The Thematic Apperception an 


nvestigation in Clinical Practice,” Journal 


or 


R - E. Henry, op. cit. kK 
Prschach Methods of Personality I 
of Psychology, Vol. 15. 1943. pp- 49-74 
“C.D. Morgan and H. A. Murray, op. cit. 


Projective Methods 
544 


cludes that the results of the intensive study disclosed no material 
inconsistent with his TAT stories and analysis."* On the whole, it was 
found that the TAT and other methods supplement one another, that 
each contributes something to an understanding of the personality not 
revealed by the others. ; 

Experimental Changes. There have been few reports of experi- 
mentally produced changes in connection with the TAT. The principle 
of this method is that validity is shown to the degree that experimen- 
tally induced changes correspond with TAT changes. In one such 
experiment the need of “achievement” was selected for study with 
college students, with whom four pictures were used, administered 
under experimentally controlled conditions.” Different subjects per- 
formed under conditions called “relaxed,” “neutral,” “failure,” and 
“success-failure.” The “relaxed” state was created by telling the sub- 
jects that the test is merely experimental; the “neutral” by 
to do their best, though the test is experimental; 
creating a sense of failure on previous 
“success-failure” by creating 
vious pencil and paper tests. 


The “relaxed” state was interpreted as being least motivating and 
the “failure” state most motivating as regards the need for “achieve- 
ment.” When this hypothesis was tested, as related to scoring cate- 
gories relevant to “achievement,” significant differences under the four 
experimental conditions were found (at the 5-percent level or better) 
for many of the categories: e.g., 
deprivation related to achievement 
a goal, 

It appears, then, that tem 
influence TAT records. Ther 
experiment 


urging them 
the “failure” by 
paper and pencil tests; the 
a sense of success and of failure on pre- 


increase in achievement reports, of 
» acting to achieve a goal, projecting 


porarily induced ego-involving tasks can 
e is a basic distinction, however, between 
ally produced, transitory conditions and their effe 
responses, on the one hand, and actual, more or | 
sonality traits and content, on the other. The TAT i 
the latter, primarily. The foregoing t 
only very limited significance 


cts upon 
ess permanent per- 
s intended to assess 
ype of experiment, therefore, has 
as a method of estimating this instru- 


“sS. > Tomkins, “Limits of Material Obtainable in the Single Case Study a 
Daily Administration of the Thematic Apperception Test.” Psyc. ical But- 
letin, Vol. 39, 1942, p. 490, ption Test,” Psychologica 


6 D. C. McClelland, et al., “The Projective Expression of Needs: IV. The 
Effect of the Need for Achievement on Thematic Apperception,” Journal of 
Experimental Psychology, Vol. 39, 1949, pp. 242-255. 


Thematic Apperception Test 545 


ment’s validity. This type of experimental method can indicate only 
the test’s degree of sensitivity to temporary and artificial conditions 
under which a person performs.” 


Evaluation of the TAT. The Thematic Apperception Test has suffi- 
ciently commended itself, from the point of view of psychological 
theory and practice, to become one of the most widely used of the 
Projective techniques. It is being employed primarily in clinical studies 
of the maladjusted and the abnormal. At the same time, it is being 
used in the study of personality traits of selected normal groups (¢.g., 
college students), of groups having particular attitudes (e.g., anti- 
labor), and of cultures different from our own. All these applications 
of the TAT have contributed to a better understanding of the person- 
alities with which they have been concerned. In the main, the principal 
use and value of this instrument, at present, is to provide valuable 
Material supplementary to other sources of information. 

In responding to the thematic-apperception test-situation, the sub- 
ject is free from social tensions that often accompany the early psy- 
chological interviews. Consequently, TAT responses provide effective 


Starting places for interview and treatment; they are thus time savers 


and facilitators in the entire process. 
The range of material in the test—permitting wide quantitative and 


qualitative individual differences in expressions of wishes, fantasies, 
frustrations, modes of adjustment—is one of its assets as a projective 
Method, At the same time this wide range has thus far proved to be a 
Weakness, in that procedures in administering, scoring, and interpret- 
1ng vary, depending upon the conceptual system of the user and the 
Purpose for which he is employing the test. Further use and analysis 
Should determine the most desirable standard procedure and the most 
Valid organization and interpretation of responses. The achievement of 
these ends demands, however, a more adequate systematic analysis of 
the Psychological process underlying TAT reports. Improvements in 
analyzing and interpreting responses should contribute to greater va- 


lidity of the test. 
Validity can be improve 
between details of stories in conjunc 
os 5 1 3 iv , 
-Fora summary see G. Lindzey, “TAT: Interpretive Assump 
tions id Reel Eopiril Evidence,” Psychological Bulletin, Vol. 49, 1952, 
Pp. 1-25. 


d, also, by more analyses of relationships 
tion with known behaviors and 


546 Projective Methods 


traits of the subjects who tell them; that is, a close study of the ele- 
ments within the stories that contribute to validity and of those that 
do not. Concurrently, it will be necessary to analyze and describe the 
kinds of situations (the organized “psychological fields”) in which the 
TAT is more or less effective or ineffective. Beginnings have been 
made in these directions of research. If they prove fruitful, we should 
then be able to indicate not only validity in general but also validity in 
particular with respect to certain specified personality patterns or syn- 
dromes, and with respect to certain kinds of situations. 

It will be necessary, furthermore, to analyze TAT reports—in fact, 
all projection-test responses—in the light of significant determinants of 
behavior and personality development of the subjects. Such determi- 


nants are age, sex, special occupational training, and social, economic, 
and cultural status and values (broadly, one’s caste and class status). 


Upon the basis of such analyses it probably will be found that formu- 
las and categories of scoring and interpreting will require modification 
and that the test itself will need revision and extension, Indeed, it is 
already becoming clear that separate thematic apperception tests are 


needed for some groups, such as children and adolescents. These are 
described in the next chapter. 


20. 


Aw 
CorereeTrrerrererenrrnrerrerernerentrrnrrertrtcctccocccoccoc nt SfOOS0550000000000000000000000000000000005000030 


PROJECTIVE METHODS: VARIOUS 


IT WAS inevitable that a variety of projective devices should appear 
after the Rorschach and the TAT had found such wide favor and 
usefulness among psychologists. Although many of the newer tests are 
being utilized, more or less, and are proving useful to some degree in 
a limited number of situations, at present they must be regarded as 
tentative and in their early stages of development. In this chapter, 
some of these will be described and evaluated in order to give the 
reader a more comprehensive view and appreciation of this field of 
Psychological testing. Of these, all but one (word association) ap- 


Peared subsequent to the Rorschach and the TAT. In addition, we 


shall describe several other projective methods that are not tests in 


the usual sense but that have been employed for many years and are 
Still widely used and are being improved (e.g., play, story telling, fin- 


8er painting). 


WORD ASSOCIATION TESTS 

This projective method has a long history in psychological ex- 
Perimentation, dating from the work of Francis Galton published in 
1879.1 For many years after that, word association was experimentally 
Studied in psychological laboratories. With the growing interest in 
Psychoanalysis, after 1900, the word association method received in- 
creasing attention as a possible clinical technique. Jung and his col- 
leagues, beginning about 1906, made extensive investigations of the 


1 “Psychometric Experiments,” Brain, Vol. 2, 1879, pp. 149-162. 


8 Projective Methods: Various 
54 


technique for use as a quick means of detecting “complexes.” ka 
clinicians, preceding and following Jung, were likewise concerned wi 
word associations as a method of diagnosis. , i 
Jung’s list of one hundred words was selected as representing the 
common emotional “complexes.” The subject is told that the oon 
will speak a series of words, one at a time; after each word, the sübject 
is to reply as quickly as possible with the first word that comes to his 
mind; there is no right response, no wrong response. The examiner 
records the reply to each stimulus word, the reaction time, and any 
unusual speech or behavior manifestations accompanying a given re- 
sponse. Replies to stimulus words that are emotionally toned for the 
subject generally have a longer reaction time and may also evoke 
physiological changes (e.g., in respiration, flushing, blood pressure), 
restless movements, coughing, laughing, and mild speech impediments. 
Jung proceeded on the principle that whenever a stimulus word was 
relevant to an emotional disturbance of the subject, an irregular re- 
sponse woulg result, The content of the responses, their reaction time, 
and attendant conditions were analyzed for the discovery of emotional 
tensions, inferred from the classes of words to which noteworthy re- 


sponses were given. The inferences then serve as starting places for 
further psychological interview, 


The best known of the word-association tests in the United States is 


the one devised by Kent and Rosanoff for the purpose of differentiat- 
ing between the mentally ill and the normal.” Unlike Jung, they used 
words that were not intended to indicate personal emotional problems 
but were, rather, neutral in character and were to provide diagnostic 
evidence on the basis of the proportion of common (normal) re- 


normal). Recording reaction times and 


most common, less common, 
of responses falling in each 


°C. G. Jung. Studies in Word-Association, trans. by M. D. Eder, London: 
Heinemann. 1918: also, “The Association Method,” American Journal 0 
Psychology. Vol. 21, 1910, Pp. 219-269, ” 
3G. H. Kent and A. J. Rosanoff, “A Study of Association in Insanity: 
American Journal of Insanity, Vol, 67, 1 


910, pp. 37-96, 317-390. 


Word Association Tests 549 


entiate the normal from the abnormal. It was found, however, that 
the associations did not distinguish clearly enough between the two 
groups, although the results were at times useful as additional evidence 
in the case of a particular person. 

The Kent-Rosanoff word list is given below. 


I. table 41. high 
2. dark 42. working 
3. music 43. sour 
4. sickness 44. earth 
5. man 45. trouble 
6. deep 46. soldier 
7. soft 47. cabbage 
8. eating 48. hard 
9. mountain 49. eagle 
10. house 50. stomach 
11. black 51. stem 
12. mutton 52. lamp 
13. comfort 53. dream 
14. hand 54. yellow 
15. short 55. bread 
16. fruit 56. justice 
17. butterfly 57. boy 
18. smooth 58. light 

9. health 


19. command 
20. chair 

21. sweet 
22. whistle 


au 
=) 


. Bible 
61. memory 
62. sheep 


23. woman 63. bath 
24. cold 64. cottage 
25. slow 65. swift 
26. wish 66. blue 
27. river 67. hungry 
28. white 68. priest 
29, beautiful 69. ocean 
30. window 70. head 
31. rough 71. stove 
32. citizen 72. long 
33. foot 73. religion 
34. spider 74. whiskey 
35. needle 75. child 
36. red 76. bitter 
37. sleep 77. hammer 
38. anger 78. thirsty 
79. city 


39. carpet 


40. girl 80. square 


550 Projective Methods: Various 
2 


81. butter 91. moon 
82. doctor 92. scissors 
83. loud 93. quiet 
84. thief 94. green 
85. lion 95. salt 
86. joy 96. street 
87. bed 97. king 
88. heavy 98. cheese 
89. tobacco 99. blossom 
90. baby 100. afraid 
A number 


of other word lists, having more or less in common with 
the foregoing, are available, The most recent of these, representing 
a revival of clinical interest in this technique, is that provided by 
Rapaport, Gill, and Schafer.’ They intend their list as an aid in clini- 
cal diagnosis, and for estimating the degree of maladjustment and 
impairment of thought organization, Their list is heavily loaded with 
stimulus-words Presumably of analytical significance, especially in re- 
spect to psychosexual matters, While their statistical evidence is quite 
tentative, the work and interpretations of these investigators has the 
merit of being based upon an analysis of what they believe to be the 
psychological Processes involved in a word association test. Their 
psychological rationale Tests upon three aspects of associative response 
which is considered to be a thought process: (1) its memory aspect, 
chiefly in terms of the emotional factors that affect the process of as- 
Sociative recall; (2) concept formation, in terms of relevance of re- 
sponse to stimulus word; (3) “anticipation” aspect, in terms of the 
common or popular character of the conceptual responses, based pre- 
sumably upon the ability of the Subject to take a set or attitude from 
the examiner’s instructions and from the character of the test words 
Produce a sensible response. 
€cent modification of the word association technique is the 
homographic free association test. A homograph is a word spelled 
exactly like another but with a different meaning and a different deri- 
vation. An example is the Word “base” which means “foundation” and 


ning “wicked.” Thurstone’s homo- 


1 Op. cit., Vol. 2, Chapter 2. 


Picture Tests gsi 


physical one.* The word “revolution” is one of such a list. The subject 
is asked to respond to each word with a synonym or a short phrase. A 
person's associative response to “revolution,” for example, can have 
social or physical relevance: namely, “a political upheaval” or “the 
turn of a wheel.” The purpose of such a restricted list of words would 
be to identify persons who are the more strongly oriented toward their 
social environment. This is shown by the number of responses that are 
essentially social in nature rather than physical or literal. 

The factors influencing associations to word lists are many and 
must be taken into account in the construction and utilization of 
response-frequency tables, and in the interpretations of responses.* 
These factors include not only the possibility of “complexes” and 
thought impairment, but differences in word usage resulting from 
regional, cultural, and socio-economic membership; from level of gen- 
eral intelligence; and from age level. Generally, therefore, the signifi- 
cance of any word responses must be sought in the experiences of the 
individual giving them. 

In the study of personality, both as to content and structure, it is 
Not likely that word association tests will have nearly the wide use and 
importance of the Rorschach and the thematic apperception tests, 


PICTURE TESTS 

The Travis-Johnston Projection Test.” This 1949 adaptation 
Of the thematic apperception method is intended specifically for the 
exploration of parent-child relationships. The test consists of two sets 
of drawings for use with children between the ages of 4 and 15 years 
—One set for girls and one for boys, each having 44 pictures. The pic- 
tures have been devised to evaluate the ways a child manages his 
“basic strivings in the face of cultural demands”; that is, his methods 
of becoming socialized and his degree of success, In selecting the situ- 
ations to be portrayed, the authors assumed that“. . . certain wants 
4nd wishes from the earliest days of life will run counter to the social- 
'Zing process set by the family constellation.” This assumption, how- 


+ L. SF UrStOne, Word Associations with Homonyms, University of Chi- 


cago Psychometric Laboratory, No. 79, 1952. ; l 
"See, for example, D. Rapaport, Emotions and Memory, Baltimore: Wil- 


liams and Wilkins, 1942; A. D. Tendler, “Significant Features of Disturbance 

R Free Association,” Journal of Psychology, Vol. 20, 1945, pp. 65-89. 

Cafeisttibuted by Griffin-Patterson Co., 544 W. Colorado Blvd., Glendale, 
alifornia, 


Projective Methods: Various 
552 


is not necessary for the use of the test or for the interpretation of 
ever, 1S 
ject’s reports. ae het 
et ore are drawn in the form of sketches, omitting distin 
ishing facial features and unnecessary environmental detail; they are 
guis 


Fic. 20.1. A thematic apperception test for the ex- 

ploration of parent-child relationships. 

Travis-Johnston Projection ‘Test, Griffin 
Co, (By permission. ) 


From the 
Patterson 


thus believed to be usable for the several cultural and socio-economic 
groups of our population. They portray adults and children of both 


sexes in numerous and varied situations and relationships which deal 
with important and potentially and actually troublesome areas in the 
socialization of children. The areas selected are sibling rivalry, child- 
parent rivalry, discipline, eating, sleeping, toilet training, cleanliness, 


Picture Tests 553 


and sexual development. This test is distinguished by the fact that its 
situations are structured for age, sex, types of activity, combinations 
of characters and objects, and specification of areas to be explored. 
Being quite new, this instrument is in process of development; and, 
as its authors state, more time is needed for its satisfactory evolution. 
This test does have promise of value in supplementing children’s case 
histories, disclosing dynamics of behavior, identification of behavior 
areas that are particularly troublesome, and indicating the need for 


and success of therapy. 


Thompson modification of the TAT.” Clinical experience with the 
Murray TAT convinced some psychologists that Negroes very often 
were unable to identify themselves with the stimulus figures of white 
persons. These subjects gave reports (stories) which were largely sim- 
ple, matter-of-fact descriptions of the persons and objects in the pic- 
tures. They did not show empathy based upon the interpersonal 
relationships being portrayed. Yet the original TAT itself is based 
upon the principle that greatest identification will occur when there is 
the greatest number of symbolic elements in the picture relevant or 
common to the subject being examined. Hence, if a picture does not 
reflect the experience and culture of the examinee, there will be little 
or no identification and the test will yield only simple descriptions or 
Will become a mere verbal exercise.” 


Children’s Apperception Test." This test, for children of 3 to 10, 
Consists of ten pictures in which the characters are animals rather than 
human beings. The pictures show these animals in commonplace Au- 
man situations and relationships: eating, sleeping, shopping, being 
Punished, ete. The assumption underlying the use of pictures of ani- 
mals is that children will more readily identify with them than with 
Persons. In support of this, the authors cite only Sigmund Freud’s 
Widely known report on “The Phobia of a Five Year Old.” 

The authors state in their manual: “The pictures were designed to 
elicit responses to feeding problems specifically, and oral problems 


£ Prepared by c E Thompson. Published by Harvard University Press, 
Togo, =n ani 

|” For a criticism of the underlying assumption of this revision see B. F. 
Riess et al., “Further Critical Evaluation of the Negro Version of the TAT, 
ournal of Projective Techniques. Vol. 15, 1951, pp. 394-400. 
By L, Bellak and S. S. Bellak. Published by C.P.S. Co., P.O. Box 42, 


Gracie Station, New York 28, N. Y. Second edition, 1950. 


554 Projective Methods: Various 


generally; to investigate problems of sibling rivalry; to illuminate the 
attitude toward parental figures and the way in which these figures are 
apperceived; to learn about the child’s relationship to the parents as a 
couple. . . .” The pictures are intended to elicit, also, the child’s fan- 
tasies regarding aggression and the adult world, and his methods of 
responding to and dealing with his problems of growth. 

The themes of the pictures are derived primarily from problems and 
relationships suggested by psychoanalytic principles of development 
and behavior; and interpretations of the stories rest upon symbolic 
significance attributed to the content. Although psychoanalytic prin- 
ciples guided the authors of this test in its construction, it is possible 
to use other systems in the interpretation of responses. 

While reports from clinicians indicate that this test can be helpful 
in some instances, when used by a skillful person, acceptance of the 
basic assumptions must await actual demonstration through research. 
A number of questions are yet to be answered: Do children in general 
identify more readily with animals? '' What are the relationships be- 
tween responses and (1) age, and (2) socio-economic status? How 
are responses influenced by intelligence level? '* Are there significant 
sex differences? 


Symonds Picture-Study Test. This set of pictures is designed for use 
with adolescent boys and girls. The cards depict a large number of 
situations and interpersonal relationships in which individuals of this 
developmental stage commonly find themselves, The stories told in 
response to the pictures are to be analyzed for the psychological forces 
indicated by them, The forces are among those commonly described 
in dynamic psychology: 


11 For negative results, see K. R. Biersdorf and F. L. Marcuse, “Responses 
of Children to Human and to Animal Pictures,” Journal of Projective Tech- 
niques, Vol. 17, 1953, pp. 455-459. 

12 Responding to pictures is one of the test items long used in the Binet 
scales. At age 3-6, only “enumeration” is expected. The next stages are “de- 
scription” and “interpretation” (no longer included in the S-B). Since this is 
the course of development, it seems that “identification” may not be expected 
ordinarily at the first two levels, at any rate. This position was supported, at 
least tentatively: N. A. Kaake, “The Relationship between Intelligence Level 
and Responses to the Children’s Apperception Test,” Unpublished M.A. Thesis, 
Cornell University, 1951. 

13 By P. M. Symonds. Bureau of Publications, Teachers College, Columbia 
University, 1948. 


Picture Tests 555 


hostility and aggression ambition and striving for success 
love and erotism conflicts 

ambivalence guilt 

punishment guilt reduction 

anxiety depression, discouragement, despair 
defenses against anxiety happiness 


moral standards and conflicts sublimation 


Tentative norms indicate that the five themes most frequently ob- 
tained deal with family relationships, aggression, economic concern, 
Punishment, and separation. Informal reports thus far available on this 
instrument suggest that it and similar specialized tests have possibilities 
of considerable usefulness: but as is the case with most of the new 
devices, significant research findings have not yet been published." 


The Blacky Pictures. This set of eleven pictures, intended for ages 
5 and over, is offered as a “measure of psychosexual development.” 
Originally devised to test several Freudian hypotheses, the pictures are 
Now used to learn the degree to which the subject has developed the 
Various psychosexual traits. The pictures portray a young dog, Blacky, 
in Situations intended to represent relevant experiences with three 
Other dogs: namely, his father, his mother, and his sibling. 

As in the case of the Children’s Apperception Test, the Blacky pic- 
tures often anthropomorphize the animals by presenting them in obvi- 
Ously and characteristically human situations. The same questions that 
Were raised regarding the former device may be asked of the latter. 

Validation studies thus far published suggest “that there is some- 
thing there . . . but do not necessarily indicate what it is or where it 
isi Obviously, then, much more research is needed before the nature 


and significance of findings can be determined.” 
eN i 
4 Although the Symonds pictures were drawn by a professional artist, they 


Seem to portray characters who predispose the subject to respond with unhappy 


Or “Problem” stories. This writer's graduate students have characterized the 
Pictured individuals as “sad,” “sour,” “unhappy,” “starved,” “gaunt.” 


$ sychological Corporation, 1950. 
S: € S Heat "The Validity of the Blacky Pictures,” Psychologi- 
Cal Bulletin, Vol. 49, 1952, pp. 238-250. À , as 
* See also, A. Ellis, “The Blacky Test Used with a Psychoanalytic Patient, 
Journal of Clinical Psychology. Vol. 9, 1953, pp. 167-172; M. L. Aronson, A 
tudy of the Freudian Theory of Paranoia by Means of the Blacky Pictures, 

Journal of Projective Techniques, Vol. 17, 1953, pp. 3-19. 


556 Projective Methods: Various 


The Four-Picture Test. Although this test was completed in 1930, 
it was not made available until 1948. Eighteen years were devoted to 
experimental and clinical work before van Lennep felt it was ready for 
distribution. As a result, the manual is carefully prepared, complete, 
and explicit. 

The test consists of four vaguely drawn colored pictures (repro- 
duced from water colors) that are placed before the subject who is 
asked to write (or tell) one story that arranges and combines all four 
pictures according to his own choice. Each picture portrays a different 
situation: (1) being together with one other person (a room showing 
two persons, with a table between them. One person is standing and 
apparently gesturing); (2) being personally alone (a bedroom with a 
window behind the bed, and a very dim suggestion of a person in the 
bed); (3) being socially alone (a figure leaning against a lamp post on 
a rainy night); (4) being together with many others in a group (a 
tennis match, showing two players in the background and, in the fore- 
ground, two couples talking). 

While only four pictures are used (in marked contrast to other 
picture apperception tests), the fact that they have to be combined 
into a single story provides opportunities for a great variety of re- 
sponses. The uniqueness of this picture test, then, lies in the use of 
color, the small number of pictures, and the requirement of a single 
integrated story. 

The van Lennep test is not quantitatively scored in any respect. Its 
interpretation is entirely qualitative, being concerned with the subject’s 
“general attitude toward life.” This attitude is given in descriptive 
terms and in terms of dynamic and systematic psychological principles 
of behavior and development. 

In spite of the fact that the test manual presents no statistical data. 
van Lennep’s instrument has been favorably received—at times en- 
thusiastically—by psychologists who have used it. This reception is 
attributable to the careful and thorough psychological analysis of re- 
sponses obtained during the test’s long period of development. 


18 By D. J. van Lennep. Martinus Nijhoff, The Hague, Netherlands, 1948. 
See also by van Lennep, “The Four Picture Test,” Chapter 6 in /ntroduction to 
Projective Techniques, H. H. and G. L. Anderson (editors), New York, 
Prentice-Hall, 1951; E. S. Schneidman, “Some Comparisons Among the Four 
Picture Test, TAT, and Make a Picture Story Test,” Rorschach Research EX- 
change and Journal of Projective Techniques, Vol. 13, 1949, pp. 150-154. 


Picture Tests 557 


Make A Picture Story Test.” This device, for use with adolescents 
and adults, combines the essential feature of the TAT (telling stories) 
with an innovation: namely, giving the subject an opportunity to con- 
struct his own pictured situation within the limits of materials provided 
by the test. These materials consist of 22 “background pictures” 
(achromatic, on cardboard): some ambiguous, others semi-structured, 
still others definitely structured. Among the backgrounds are, for ex- 


ample: 
living room street forest 
bedroom camp cave 
bathroom landscape schoolroom 


There are also 67 cut-out figures (65 of which are human and 2 ani- 


mal) representing: 


male adult indeterminate as to sex children 
female adult legendary and fictitious a dog 
minority groups silhouette and blank faces a snake 


These figures are portrayed in a variety of postures and states. 

The examiner selects one background picture at a time, asks the 
Subject to place one or more figures of his own choosing against the 
background as they might appear in real life and then to tell a story 
about the scene he has created. Any of the figures may be realistically 
Placed against any background, thus the principle here is that the indi- 
Vidual may project whatever actions and relationships he wishes upon 
Whatever persons he chooses. ; 

After each story, the examiner conducts an inquiry, as in the case 
of the TAT, The complete responses are recorded and analyzed ac- 
cording to a rather detailed and elaborate scheme. Essentially, the 
stories are analyzed primarily for form and secondarily for content. 
The second of these—content—may be of almost unlimited variety, 
intended to elicit through these “fantasy productions” the subject's 
Social adjustments and relationships. By analysis of form, the test’s 
author means “. which figures are chosen, how many are chosen, 
Where they are placed on the background, how they are handled by 
the subject, and what relationships they bear to each other.” Form is 
analyzed and characterized in terms of forces operating on the subject 


man. Psychological Corporation, 1948. 


™ By E. S, Schneid 


558 Projective Methods: Various 


(goals, drives, conflicts, values, ego ideals, etc.) and of modes of be- 
havior (hostility, sexuality, aspiration, autonomy, etc. ). 

It is thought by some psychologists that the stories evoked by means 
of this test may be more spontaneous and richer in fantasy production 
than the TAT because the pictured situation has been created by the 
subject himself. On the other hand, it is possible that the subject will 
evade certain types of situations and problems, significant in his case, 
with which he would have to deal in one way or another in the TAT 
type of test. 


The Michigan Picture Test.“ This very recent instrument is the most 
systematically and thoroughly constructed thematic apperception test 
that has appeared since Murray’s TAT. It is intended for use with 
children from eight to fourteen years of age. The test consists of six- 
teen pictures, only twelve being used with each child, the selection 
depending upon the subject’s sex. The basic principle is the same as 
that of the original TAT: revealing needs through responses to pic- 
tures. Construction of the Michigan test was motivated by reports 
from many child guidance clinics that the TAT was not entirely suita- 
ble for children under fourteen years of age and that a special test, 
suitable for younger subjects, was necessary. 

The authors state in their manual that the over-all purpose of their 
project was to “. . . investigate and measure the emotional reactions 
to children in the preadolescent and adolescent stages of development. 

It was believed the test should be non-traumatic and yet tap the 
common conflict situations for this age group.” To do this, they fol- 
lowed the methods of population sampling, behavior sampling, and 
response analysis prescribed for and very frequently found in the con- 
struction of tests of intelligence and specific aptitudes. 

Results of the construction process indicated that certain test varia- 
bles did effectively discriminate between groups of well-adjusted and 
groups of poorly-adjusted children. These variables constitute: (1) 
the “tension index,” (2) “verb tense,” (3) “direction of forces.” The 
tension index is based upon the following four types of needs: “' 


love: verbal expression, positive or negative, indicating affection, 
affiliation, attachment, friendship, or admiration; 


2# By G. Andrew, S. W. Hartwell, M. L. Hutt, and R. E. Walton. Science 
Research Associates, 1953. 
21 Manual, pp. 66 ff. 


Picture Tests 559 


extrapunitiveness: verbal expression of aggression toward an ex- 
ternal object; 

submission: verbal expression of defeat, resignation, passivity, 
compliance, obedience, acceptance of suffering without opposition; 

personal adequacy: verbal reference, positive or negative, of hap- 
piness, strength, competence, or any reference to the temperament 
or physical characteristics of the human or animal figures in the 
story, 


“Verbal tense” of responses is scored for frequencies of past, pres- 
ent, and future tenses. The hypothesis, quite empirical in origin, is that 
disproportionate emphasis upon each of the several tenses may indi- 
cate behavior as follows: 


Past tense: avoidance of conflict 
a regressive trend 
schizoid character structure 
submissiveness or isolation 


present tense; compulsivity or pedantry 
personality disturbance 
effective intellectual functioning 


future tense: anxiety 
disturbed but relatively mature personality 
ineflicient intellectual functioning 


“Direction of forces” refers to the action expressed in the story: 


centrifugal direction (outward action) 
centripetal action (inward action) 
neutral (no direction indicated ) 


Well-adjusted children express both centrifugal and centripetal direc- 
tions much more frequently than the poorly-adjusted. 

A second group of variables, for which the stories are also scored, 
Consists of those “. . . for which trends were indicated, but in which 
differences were not statistically significant.” This group of variables 
includes the following: 

Psychosexual level: a measure of psychosexual maturity in ortho- 
dox Freudian terms; 


560 Projective Methods: Various 


interpersonal relationships: range and frequency of expressed in- 
terpersonal relationships; 

personal pronouns: frequencies of the three personal pronouns 
as a measure of self-reference; 

popular objects: most commonly referred to objects and persons; 

level of interpretation: degrees of interpretation, from “no re- 
sponse,” through enumeration and description to complex inter- 
pretation and inference. 


The results obtained for these four variables were inconclusive as be- 
tween well-adjusted and poorly adjusted groups; yet they were sug- 
gestive enough to warrant further investigation as possible differenti- 
ating criteria. 

The superiority of this Michigan test is attributable to the fact that 
the pictures were designed, in the first instance, in accordance with 
well-defined concepts of psychological development and of the most 
significant environmental situations affecting development. As already 
stated, careful sampling procedures were used. Also, unlike practically 
all other recent apperception tests, the Michigan test presents exten- 
sive data regarding significance of group differences, norms for each 
of the scoring variables, and reliabilities.” Of the instruments appear- 
ing since the original TAT, this test, therefore, appears to rest on the 
soundest base and to be the most promising for the study and assess- 
ment of children’s personalities within the specified age groups. 


Rosenzweig Picture-Frustration Study.“* The full name of this test is 
“Picture-Association Study for Assessing Reactions to Frustration,” 
although it is generally referred to by the first title given above. Con- 
sisting of twenty-four cartoon-like pictures, the test is intended to serve 
as a projective method to reveal the subject's characteristic patterns of 
response to common stress-producing situations that are regarded as 
important in normal and abnormal adjustment. Two forms are avail- 


2 Reliability correlations of the four variables constituting the “tension in- 
dex” scores obtained by two judges, grade 3: .67, .93, .93, 1.00; grade 5: .91. 
.97, .97, .98: grades 7 and 9: .70, .81, .98, .98. Fifteen cases in each group- 
Reliability correlations for “direction of forces” as scored by two judges, 
grade 3: .95: grade 5: .87: grades 7 and 9: .91. Ten cases in each group. 

2 By S. Rosenzweig. The Author, 8029 Washington St., St. Louis 14, Mo. 
1944-1948. 


Picture Tests 561 


able, one for children of ages from 4 to 14 years, and one for persons 
older than fourteen.” 

Each picture shows two persons involved in a mildly frustrating 
Situation of common occurrence. The person at the left in each pic- 
ture is represented as making a statement which either helps describe 
the frustration of the second individual or is itself actually frustrating 
the latter. The caption box above the person on the right of each pic- 
ture is blank. Purposely omitted are the facial features and expressions 
in all pictures. 

The task of the subject is to examine each picture and to write in 
the blank box the first appropriate response that occurs to him. 
(Young children, of course, will give oral responses to be written by 
the examiner.) The assumption of this test is that the subject identi- 
fies himself, consciously or unconsciously, with the frustrated indi- 
Vidual in each situation and that his replies thereto are projections 
of his own ways of acting. 

The situations presented are classified as of two kinds: ego-block- 
ing and superego-blocking. The first are those “in which some ob- 
Stacle, personal or impersonal, interrupts, disappoints, deprives, or 
Otherwise directly frustrates the subject.” The second type of block- 
ing “represents some accusation, charge, or incrimination of the sub- 
ject by someone else.” As with other projective methods, an inquiry 
follows the recording of responses for purposes of clarification and 
Possible elaboration. à 

Scoring of responses is based upon (1) direction of aggression and 
(2) reaction type. Under the first of these, three forms of expression 
are distinguished: (a) extrapunitiveness in which aggression is turned 
onto the environment; (b) intropunitiveness, whereby the subject 
turns the aggression upon himself; and (c) impunitiveness, in which 
aggression is evaded in an effort to gloss over the frustration. Type of 


28; Rosenzweig, “The Picture-Association Method and Its Application in 
a Study of Reactions to Frustration,” Journal of Personality. Vol. 14, 1945, 
Pp. 3-23: S. Rosenzweig, E. E. Fleming and H. 3 Clarke, “Revised Scoring 

anual for the Rosenzweig Picture-Frustration Study,” Journal of Psychology. 
SoL 24, 1947, pp. 165-208; H. J. Clarke, E. E. Fleming, and S. Rosenzweig, 
‘The Reliability of the Scoring of the Rosenzweig Picture-Frustration Study, 
Journal of Clinical Psychology. Vol. 3, 1947, pp. 364-370; Rosenzweig. 
‘Revised Norms for the Adult Form of the Rosenzweig P-F-S.“ Journal of 


Personality, Vol. 18, 1950. pp. 344-346. 


562 Projective Methods: Various 


reaction also has three classes: (a) obstacle dominance, in which the 
barrier occasioning the frustration stands out in the response; (b) ego 
defense, in which the ego of the respondent predominates; (c) need 
persistence, in which the subject emphasizes the solution of the frus- 
trating problem. 

The responses to the frustration pictures are intended to show the 
individual’s “frustration tolerance,” which signifies the relative ab- 
sence of observable disorganization in response to frustrating situa- 
tions; or, adequacy and efficiency of response despite frustrations. 
Frustrations are common experiences; modes of adjustment to them 
are regarded as significant in understanding behavior and personality 
organization, in that these modes reveal one’s techniques for coping 
with tensions. The Rosenzweig pictures are intended to provide a 
method of evaluating an individual’s tendency to blame the source of 
his frustration (extrapunitive), or himself (intropunitive), or to treat 
the situation impersonally (impunitive ). 

The Picture-Frustration Study has found more than a little favor 
with clinicians; and, with modifications, it has been used also in 
psychological studies of groups and group attitude 

Two questions, however, have been raised regarding the interpreta- 
tion to be given the responses to this test. First, are the responses that 
are given to the pictured incidents, which represent rather mild frus- 
trating situations, to be interpreted as indicative of the subject’s typical 
responses to frustrating situations in actual life that are of basic sig- 
nificance to his personality? Second, Symonds has asked whether the 
subject’s responses indicate what he actually would do, or what he 
thinks he should do, or feels he would like to do, but would not actu- 
ally do.” Although these questions have not been answered definitely, 
the results obtained with this test indicate either: (1) how the sub- 
ject probably would act in these kinds of situations, or (2) what he 
knows or thinks would be expected of him in such situations. Using 


2° For example, J. L. McCary, “Reactions to Frustration by Some Cultural 
and Racial Groups,” Personality, Vol. 1, 1951, pp. 84-102; J. D. Holzberg and 
R. Posner, “The Relationship of Extrapunitiveness on the Rosenzweig P-F-S 
to Aggression in Overt Behavior and Fantasy,” American Journal of Orthopsy- 
chiatry, Vol. 21, 1951, pp. 767-779. 

2 O, K. Buros, Fourth Mental Measurement Yearbook, 1953, p. 243. Also, 
M. M. Schwartz and L. Karlin, “A New Technique for Studying the Meaning 
of Performance on the Rosenzweig P-F-S,” Journal of Consulting Psychology: 
Vol. 18, 1954, pp. 131-134. 


Picture Tests 563 


the responses, then, as a basis of inquiry and interview, discovery of 
the person's level of frustration tolerance is facilitated. 


The Szondi Test.” Devised by the psychiatrist Lipot Szondi, this test 
consists of six sets of facial pictures of mental patients, eight in each 
set. Each picture is on a separate card. Each set includes one photo- 
graph said to portray the face of each of the following: a homosexual, 
a sadistic murderer, an epileptic, an hysteric, a catatonic, a paranoiac, 
a depressive, and a manic. Deri recommends that the test be ad- 
ministered at least six times to the same subject. From each set, the 
Subject is asked to select the two photographs he most likes and the 
two he most dislikes. After making his selection, the subject is asked 
to tell a story about each of the four most liked and the four most dis- 
liked, of the entire forty-eight pictures, in order to reveal specific ways 
in which they affect him. The subject, of course, is not given any 
clues as to the presumed type represented by each picture. The num- 
ber of choices made of each category is said to be indicative of some 
of the subject's personality traits. Selection of homosexuals is inter- 
Preted as indicating the characteristics of the subject’s psychosexual 
development; selection of sadistic murderers indicates aggressiveness; 
epileptics, a violent quality of behavior and thinking; hysterics, the 
emotional aspect of life: catatonics, the quality of narcissism and with- 
drawal; paranoiacs, expansive forces and creative abilities; selection of 
Manics and depressives, the person's “mood-life.” 

The number and classification of choices (like or dislike) in each 
Category, it is claimed, indicate strength of tendencies that are latent 
in the person. If few or no choices are made in a category, the inter- 
Pretation is that the tendencies are already overt in the individual. 

According to the Szondi interpretation, therefore, every person has 
all the tendencies specified by him! If certain of the pictures are re- 
jected altogether, the traits they are presumed to represent are said 
to be overt, Selected but disliked pictures mean, we are told, that the 
Subject is repressing or sublimating the tendencies which it is claimed 
they represent. Selected and liked pictures, we are told, represent 


"S. K. Deri, “Description of the Szondi Test: a Projective Technique for 
Psychological Diagnosis.” American Psychologist, Vol. 1, 1946, p. 239; also, 
Introduction to the Szondi Test, New York: Grune and Stratton, 1949. 
Rapaport, “The Szondi Test,” Bulletin of the Menninger Clinic, Vol. 5, 


1, pp. 33-39. 


194 


564 Projective Methods: Various 


tendencies that are acceptable to the subject’s “super-ego” and that 
are available for expression. For the subject who submits to the 
Szondi Test, there seems to be no escape from having attributed to 
himself some degree of each of the eight traits. 

This test is based upon two extremely unsound assumptions. The 
first is that photographs can in themselves represent and differentiate 
the several kinds of personalities named. This assumption sounds too 
much like the long-since rejected pseudo-science of physiognomy. 
The second premise is that a person’s overt behavior derives from 
his inherited genetic elements (genes) and that his selections are 
“instinctively” determined and represent latent forms of behavior 
which seek expression. This second assumption is contrary to the 
widely held views of the roles of heredity and experience in person- 
ality development. It is also contrary to the modern principles of 
behavior, some of which have radically modified the concept of in- 
stinctive behavior while others have rejected it entirely. 

This test is briefly described here as an example of an instrument 
that creates personality traits by fiat of its author. Thus far, a fairly 
large number of studies on the Szondi have been reported, their find- 
ings and conclusions being almost uniformly negative or their con- 
clusions very skeptical.* 


VERBAL COMPLETION TESTS 


Sentence-Completion Test As the name indicates, these 
are tests whereby the individual is presented with a series of incom- 
plete sentences, generally open at the end, to be completed by him 
in one or more words. They are somewhat like the word-association 
test in that the word or words used to complete the sentence follow 
from and are associated with the given part of the sentence. The 
sentence-completion test, however, is regarded as superior to word- 
association because the subject may respond with more than one 


* Buros, op. cit., pp. 255 f; H. P. David et al., “Qualitative and Quantita- 
tive Szondi Diagnosis,” Journal of Projective Techniques, Vol. 17, 1953, PP; 
75-78; M. Fleischmann, “The Discriminative Power of the Szondi Syndromes,’ 
Journal of Consulting Psychology, Vol. 18, 1954, pp. 89-95. 

2” See M. L. Hutt, “The Use of Projective Methods of Personality Measure- 
ment in Army Medical Installations,” Journal of Clinical Psychology, Vol. l» 
1945, pp. 134-140; A. R. Rhode, “Explorations in Personality by the Sentence 
Completion Method,” Journal of Applied Psychology, Vol. 30, 1946, pp. 169- 
181; J. B. Rotter and B. Willerman, “The Incomplete Sentences Test,” Journal 
of Consulting Psychology, Vol. 11, 1947, pp. 43-48. 


Verbal Completion Tests 565 


word, a greater flexibility and variety of response is possible, and 
more areas of personality and experience may be tapped. 

The content of a particular test and the nature of the sentences 
will depend upon the group of persons and the purposes for which 
they are intended. One test may be devised to learn about satisfac- 
tions and annoyances, likes and dislikes, fears and attractions. An- 
other may be designed to reveal motives, needs, and environmental 
forces operating upon the subject. Still another may be concerned 
primarily or solely with feelings about one’s situation, as home, com- 
munity, ‘friends, occupation, school. Others may be intended to dis- 
cover certain psychological mechanisms, such as feelings of rejec- 
tion, evidence of rationalization, methods of evasion. In other words, 
each sentence-completion test shouid be adapted to the particular 
situation in which it is to be applied. 

Several sample items are the following: 


I worry over .. + 

I feel proud when . «+ + 
Other peopie usually > + - - 
I prefer to. -> 

My father used to. + +» 
My hope is - + +- 

When I was a child . . - + 

This projective method, at present in its early experimental stages, 
seem to have pos bilities of use in the study of normal personalities 
as well as with clinical cases. 

In this category, the two most fully described and most frequently 
used instruments are the Sentence Completions Test ™ (for ages 12 
and over) and The Rotter Incomplete Sentences Blank (college 
form) .** Both are designed to estimate the subject’s degree and areas 
of maladjustment, if any. Although both provide schemes for scoring 
responses, it appears that the greater usefulness of this type of test is 
for the indications of areas of maladjustment it may provide and for 
diagnostic clues. For instance, the responses of a ten-year-old boy to 


Sia 


* By A, R. Rhode an 
1947 


a By J. B. Rotter. The Psychological Corporation, 1950. Manual in col- 
laboration with J. E. Rafferty. Also, J. B. Rotter et al., “The Validity of the 
Otter Incomplete Sentences Blank.” Journal of Consulting Psychology, Vol. 


18, 1954, pp. 105-111. 


d G. Hildreth. The Psychological Corporation, 1940- 


566 Projective Methods: Various 


one set of completions were all neutral and indicative of satisfactory 
adjustment, except with regard to his school experiences. 

On the whole, it appears that sentence completion tests evoke per- 
sonality materials that are closer to the level of consciousness than 
those evoked by the Rorschach and the thematic apperception type.” 
The sentence completions do, nevertheless, provide a basis for sub- 
sequent interview and counseling. 


Projective Questionnaire. By means of this technique, the subject is 
given a series of questions to answer in his own way. Unless he is 
aware of the psychological significance of the questionnaire, he does 
not grasp the implications of the questions or of his answers to them. 
It is possible, by this method, to ascertain some information regarding 
the subject’s emotional life, his values, his attitudes, and sentiments. 
Each question is intended to show which dynamisms might be operat- 
ing in an individual's behavior; for example, suppression, projection, 
identification, and others. The specific questions to be included in any 
particular device will depend upon its purpose and upon the persons 
for whom it is intended. In any event, the value of this type of test, 
at present, does not lie in a numerical score; indeed, rating schemes 
have not yet been devised, since this technique is quite recent and has 
not been used extensively. The value of the projective questionnaire 
lies, rather, in the fact that the answers are interpreted as revealing 
certain traits and serve as a basis for psychological interview. 

The following are items which were included in the questionnaire 
devised by the psychological staff of the Office of Strategic Services 
during World War II, to be used, among other tests, in the screening 
and job assignment of personnel. (See Chapter 21.) 


It seems that no matter how careful we are, we all sometimes have 
embarrassing moments. What experiences make you feel like sinking 
through the floor? 

What kinds of things do you most dislike to see people do? 

What things or situations are you most afraid of? 

If you were (are) a parent, what things would you try to guard 
your children against most carefully? 


* For example, E. Hanfmann, “Studies of the Sentence Completion Test.” 
Journal of Projective Techniques, Vol. 17, 1953, pp. 280-294; J. Byers. “The 
Relationships between Sub-Cultural Group Membership and Projective Test 
Responses,” Unpublished Ph.D. Thesis, Cornell University, 1954. 


Verbal Completion Tests 567 


Story Telling and Story Completion.’ Although this method has been 
used informally for some years,“ the published work on it has not been 
very extensive or definitive. Several approaches have been employed, 
the most common with children being the following: 


Retelling of children’s popular stories, such as “The Three Bears.” 

Retelling the story the child likes best of all those ever heard or 
read. 

Stories made up on specified themes: such as on a boy or girl, a 
father, a mother. 

A story especially constructed for the purpose at hand, told to the 
children by a teacher: reproduced in writing for the teacher; later re- 
told to the therapist, this being the “emotional version.” (In evaluat- 
ing such retold stories, the therapist must take into account the com- 
mon phenomena of memory distortions.) ** 


Users of these methods report that they are helpful in revealing a 
child’s conflicts, aggressions, anxieties, wish fulfillment, affectivity, etc. 
Among other methods which have been tried are these: 


A situation is set up creating a moral conflict, the effects of which 
on the child are evaluated through the story he makes up about it 
and his methods of dealing with the conflict. 

The child is asked to invent stories about favorite comic-strip char- 
acters: and these stories are regarded as indications of his “retreat 
into fantasy,” on the one hand, or his realistic attitude toward his 
environment and interpersonal relations, on the other. The child is 
also asked to invent stories about disliked comic-strip characters as 
a means of discerning his personality tensions and for subsequent 
Use in therapy, since these stories give evidence of emotional trans- 
ference and are a means of communicating affectivity. 


Story telling as a projective method has been tried also with adults. 


Ordinarily the subject is asked to develop a story on a theme 
dealing with some aspect, or aspects, of his environment in order to 
bring out some of his strivings and modes of adjustment. 


bis E.g., H. Sargent, An Experimental Application of Projective Principles to 
a Paper and Pencil Personality Test, Psychological Monograph, Vol. 57, No. S; 
944; B, A. Wright, “An Experimentally Created Conflict Expressed by Means 
of a Projective Technique,” Journal of Social Psychology, Vol. 21, 1945, pp. 
229-245 


mt For example P. Blos, The Adolescent Personality, New York: D. Apple- 


‘on-Century, 1941. ; : DRS 
PN See F. C. Bartlett, Remembering, Cambridge: Cambridge University Press, 
32 


568 Projective Methods: Various 


A little work has been done by combining music and story telling. 
Recordings of selected musical classics are played: the subject is told 
to report the images and themes he associates with these. The ex- 
pectation is that the respondent's reports will indicate certain psy- 
chological states, such as fear, struggle, romance, animation, rever- 
ence, and others. 


Story-completion techniques are of several kinds. 


Dramatic situations are presented briefly and concisely; the subject 
is required to develop each into a skeleton of a short story. 

The outline of a story is provided, it being the subject's task to 
write a narrative based upon it. 

A brief incomplete story is presented, which the subject is to com- 
plete. 

The individual is given a group of brief statements each of which 
presents a situation of emotional conflict, followed by two questions 
on how the central figure in the situation responded to it: “How did 
he feel?” “What did he do and why?” 


A very recent elaboration and systematic development of this tech- 
nique of story completion has been offered by H. D. Sargent, called 
The Insight Test. It“. . . is composed of a series of items . . . in 
which the bare outlines of a problem situation are stated and to which 
the subject is asked to respond by telling what the leading character 
did, and why, and how he felt about it. The nature of the material and 
the task itself, which is presented as a test of insight or ability to ‘see 
into’ the motives, actions, and feelings of others is designed to con- 
form to two basic principles of a projective test: ambiguity of stimulus 
to which the subject must respond in his own personal terms, and di- 
rection of attention away from concern with self toward the task of 
reacting to something external.” * 

This story completion test has the merit of having been provided 
with a system of analysis and scoring that is intended to yield inter- 
pretations based upon a set of common principles. The scoring sys~ 
tem “. . . involves primarily a differentiation between expressions of 
feeling in response to the [items], and other types of expression which 
are regarded as serving the purpose of control, defense, and delay of 
unmodulated emotion. Phrases . . . are categorized both with refer- 
ence to the fate of the aroused affect and to the mood or content €X- 


36 The Insight Test, New York: Grune and Stratton, 1953, p. 17. 


Drawing and Painting 569 


pressed. Several indications of disturbance in thought or feeling are 
also recorded. Finally, certain relationships between expressions sup- 
posed to represent affect and defense, and within the quantities of 
different kinds of expression standing for different sorts of discharge 
and control, are computed.” ` 

It is obvious that as a testing technique, story telling and story 
completion present extreme difficulties in evaluation and interpreta- 
tion; for, with the exception of Sargent’s test, the situations are al- 
Most completely unstructured and fluid, and the possibilities of cate- 
gorical analysis are almost unlimited. At present, and perhaps per- 
manently, these techniques must be regarded as highly subjective in 
character and of use only for exploratory purposes when viewed 
against a background of other information on the individual's experi- 


ences, traits, and behavior. 


DRAWING AND PAINTING * 


If volume of publication is a criterion, then it may be said 
that drawing and painting (including finger painting) are of increasing 
interest and significance as projective methods in the study of per- 
sonality, Studies in the psychology of art as related to chronological 
age, as evidence of intelligence and specific aptitudes, as culturally in- 
fluenced, and as aesthetic judgment, have provided the basis upon 
Which investigations into the relationships between art and personality 
have been based. The point is that not only must the products of 
atypical groups be analyzed for elements and degree of communality, 
but the usual, normative products of drawing and painting of normal 
Persons, as influenced by various factors, must be determined as back- 
ground against which to project the creations of Persons whose per- 
Sonalities are being studied for diagnostic or descriptive purposes, For 
example, in the study of artistic products of psychotics it is reported 
that, among other things, they are markedly similar to the drawings of 
children and some primitives, pronounced stereotypes and mannerisms 
are typical, representation is extremely non-realistic, and their mean- 
a manner similar to analysis of dream con- 


+ P. 26. z ee P 
PER B Elkisċh; Children’s Drawings in a Projective Technique, Psycho- 


logical M à 1, 19 W. Wolff. The Expression of Per- 
a onographs, Vol. 58, No. 1. 

sonality: Besoine Depth Psychology, New York: Harper, 1943; L. B. 
Murphy “Art Technique in Studying Child Personality,” Rorschach Research 
: : A ie 24 
Exchange and Journal of Projective Techniques, Vol. 13, 1949, pp. 320-324. 


57 Projective Methods: Various 


tent. It is reported, also, that color and movement have diagnostic 
significance, as do size, line, form, and symbols used. 

Finger painting has much in common with drawing and other 
forms of painting in personality study: but, in addition, its exponents 
emphasize the considerable diagnostic value of observing and analyz- 
ing the subject’s approach to the situation and his behavior through- 
out.’ 

Methods of obtaining productions vary among investigators and 
users of these techniques, as do their evaluations and interpretations, 
in some of which considerable free play is given by the interpreter to 
his particular theoretical biases, unsupported by scientific data. The 
great needs, before painting and drawing can come to occupy a re- 
liable position in the area of projective methods, are that their psycho- 
logical rationale shall be determined, their underlying psychological 
processes shall be analyzed, and their validity shall be established. 

A quite different projective technique utilizing drawings is the 
Draw-a-Person Test." Used with individuals two years of age and 
older, it simply requires the subject to “draw a person.” When the 
first drawing has been completed, he is instructed to draw a person of 
the opposite sex. Completion of the drawings is followed by an in- 
quiry, the subject being asked to tell a story about each “person” he 
has drawn as though he or she were a character in a play or a novel. 
The examiner may also ask a series of prescribed questions about the 
“characters” in order to obtain added information about environ- 
mental factors and relationships. 

Each drawing is analyzed regarding certain specified characteristics, 
and intercomparisons of the two drawings are made in order to dis- 
cern the subject's attitudes toward himself, and toward his own as 
well as the opposite sex. In general, analysis and interpretations of 
drawings are based upon the hypothesis that the drawings represent 
one’s conception of his body in the environment. The drawing of the 
figure representing one’s sex is regarded as a “body image”—a symbol 
of the concept of the self, a reflection of self-regard. In drawing the 
human figure of either sex, it is hypothesized, the subject becomes ego- 


a9 


* E.g., P. J. Napoli, “Interpretive Aspects of Finger Painting,” 
Psychology, Vol. 23, 1947, pp. 93-132. 

1 K. Machover, Personality Projection in the Drawing of the Human Figure’ 
A Method of Personality Investigation, Springfield, Il.: C. C. Thomas, 1949. 
Also by Machover: “Human Figure Drawings of Children,” Journal of Projec- 
tive Techniques, Vol. 17, 1953, pp. 85-91. 


Journal of 


Play 571 


involved; conflicting needs and tensions are expressed through details 
and organization of the drawn figures. 

The nature of the analysis can be better appreciated by noting the 
Structural features that are taken into account: size, background, 
exactness, degree of completion, detailing, symmetry, mid-line em- 
Phasis, perspective, reinforcements, proportions, placement on the 
Page, theme, shading, erasures. Content of the drawing is also ana- 
lyzed: individual body parts, clothing, accessories, facial expression, 
Posture. Aesthetic qualities are not considered. 

Several body parts are listed below to illustrate some of the inter- 
Pretive significance attached to them: 


head: intellectual aspirations, rational control, fantasy elabora- 
tion of the personality; 

eyes: uncertainty, paranoid wariness, sexual appeal; 

Nose: masculinity and assertiveness; 

arms and hands: the primary extensive organs; indicators of de- 
gree of power, degree of reaching out, extent of environmental 
manipulation; P ; 

bilateral symmetry: degree of obsessiveness-compulsiveness; 

body points: exaggerated somatic preoccupation. 


Karen Machover herself has used and experimented with this de- 
vice for a number of years and has made some acute and insightful 
Personality interpretations with it." Other clinical psychologists, who 
have been instructed by her, have also made effective use of the test. 
However, the validation of this test as an instrument for general use 
awaits intensive study on as objective a basis as the materials permit. 


PLAY 2 

Since play is free of the constraints of ordinary adult activity 
and free of the restraints imposed by adults upon children, it is useful 
as a projective technique in the study of less apparent aspects of per- 
Sonality, For it is unstructured, provides opportunity for fantasy and 


ASE A rank et al., Personality Development in Adolescent Girls, 
SD e R Child Development, Monograph No. 53, 1951, pp. 89 ff. 
“See V, M. Axline, Play Therapy, Boston: Houghton Mifflin, 1947; G. R. 
Bach, Young Children’s Play Fantasies, Psychological Monographs, Vol. 59, 
No. A 1945; E. H. Erikson, Studies in the Interpretation of Play: Clinical Ob- 
servations of Play Disruption in Young Children, Genetic Psychology Mono- 


8raphs, Vol. 22, 1940. 


72 Projective Methods: Various 


/ 
imaginativeness to operate, and gives scope to individuality of expres- 
sion. 

Play, as a method of personality diagnosis and therapy used almost 
solely with children, was first tried as a substitute for the free-associa- 


vic, 20.2. Finger Painting. “Sometimes it bleeds! 
Look!” (Indicates paper painted red.) “Look! 
Bloody! Like my throat!” 

“He smears his hands and arms through the fin- 
ger paint. And as he pins his thoughts and feclings 
down on paper he feels, perhaps, more secure. 
After he has captured them on paper he can handle 
them a little better.’ rom V. M. Axline, Play 


ae Boston: Houghton Mifflin. (By permis- 
sion. y 


tion technique of psychoanalysis. From such use, investigators have 
emerged with the theory that play activity is determined by a com- 
plex of factors; that it provides an outlet for release of emotional 
tensions and for overt or symbolic behavior expressing needs, wishes, 


Play 573 
desire for experience, attitudes, without fear of censure or punish- 
ment. 

The commonly used play technique of personality study and ther- 
apy consists of this: the child is introduced to a collection of toys 
which he is permitted to use freely, while the observer notes his activi- 
ties with respect to the particular items employed, the use made of 


> 7 - oe 


ac. 20.3. Doll Play. She began to play with the family of dolls. 
“This is Tom,” she said to the therapist. “Tom is a funny boy. 
Want to know what happens to Tom? From Vv. M. Asline, Play 
Therapy, Boston: Houghton Mifflin. (By permission.) 


tion or patterning of toys, attitudes toward each 
and general behavior in the play situation. The 
as to provide insights into a child's 
de dolls representing members 


them, the organiza 
toy, vocalizations, 
toys themselves must be such 
Personality and needs. The items inclu 
Of the family, furniture (especially kitchen and bathroom appliances), 
Water, sand, vehicles, animals, building blocks, balloons, sticks, or any 
vant in a particular instance. 

r With play, as with drawing and painting, methods of analysis and 
interpretation of behavior differ according to the theoretical frame- 
Work and assumptions of the investigator or therapist. However, 


Other pieces that might be rele 


574 Projective Methods: Various 


while the play technique of personality study is at present a highly 
subjective method, numerous case reports attest to its value as a diag- 
nostic and therapeutic procedure. 

An effort has been made to standardize and objectify the evalua- 
tion of children’s play in a therapeutic situation." This device, called 
the World-test, utilizes numerous miniature pieces that represent a 
large variety of objects and persons found in children’s environments. 
The child is simply instructed to create any setting or situation he 
might choose. The child’s performance and emotional condition are 
interpreted in terms of the number and variety of pieces used, the 
structure of the situation (rigid, flexible organization, chaotic), ag- 
gressiveness, etc., (depending upon which pieces are used and how 
they are used). Although this technique is more formalized and has 
some objective aspects, the basic value of the method depends upon 
the interpretation of the child’s performance and upon demonstrable 
relationships between this play activity and children’s problems of ad- 
Justment. 


EVALUATION OF PROJECTIVE TESTS 


Since the Rorschach and the Murray TAT are the two most 
widely used projective tests and have been subjected to the most ex- 
tensive and thorough research, they have been described and ex- 
plained in some detail. The other projective tests, described and 
briefly commented on in this chapter, were presented for these rea- 
sons: to acquaint the student more fully with psychological thinking 
in this area, to show how this thinking is now being implemented, and 
to indicate the variety of instruments. 

The following paragraphs summarize the present status and the 
major problems of current projective tests. 


The Rorschach and the TAT have proved to be of considerable 
value, even though much scientific research (clinical and experi- 
mental) remains to be done to fulfill all requirements of a soundly 
standardized test. It should be recalled, however, that many spe- 
cialists maintain that the usual specifications of standardization can- 
not and should not be applied to projective tests. 


*C. Buhler et al., “World-test Standardization Studies,” Journal of Child 
Psychiatry, Vol. 2, 1951, pp. 2-81. The original work was published by 
M. Lowenfeld, in England; “The World Pictures of Children,” British Journa 
of Medical Psychology, Vol. 18, 1939, pp. 65-101. 


Evaluation of Projective Tests 575 


Adequate normative data are lacking for projective devices, even 
for the Rorschach and the TAT. In some instances no normative 
data whatever are available. Response norms, both quantitative and 
qualitative, should be obtained for various groups, according to age, 
sex, educational levels, and clinical classification. More emphasis 
upon “normal” populations is needed to provide the proper perspec- 
tive within which to view performances of clinical groups. The views 
and interpretations of clinicians have been too strongly affected by 
their almost exclusive contacts with atypical personalities. 

Reliability in terms of agreement among several scorers (some- 
times called “scorer reliability”) is most significant for these instru- 
ments and on the whole has yielded favorable results. Test-retest 
reliabilities are of limited value and present special difficulties, as 
pointed out. The method of split-half reliability is inappropriate for 
reasons indicated. 

Validity studies of the matching type have proved most satisfac- 
tory. The use of known groups has been only moderately satisfactory 
because the criteria have been qualitatively and, to some degree, 
differently defined. Furthermore, the application of these definitions, 
for purposes of classification, is a subjective matter. The relative un- 
reliability of psychiatric classification has already been noted. Valid- 
ity studies are often complicated by the fact that interpretation of 
test data is “global” rather than in terms of limited and measurable 
elements. Laboratory studies of experimentally induced states or 
“sets” may be seriously questioned as a method; for they do not deal 
With basic and established personality traits. A possible exception is 
the use of hypnosis and similar means. 

Although instructions for and procedures in administering a given 
test do not vary radically among specialists, uniformity is highly de- 
sirable, In this connection, the attitude of examiners (reassuring or 


Neutral) is important. 

Specialists will have to develop more nearly uniform and objective 
Scoring systems. Since interpretation must be based upon a theory 
and an understanding of the dynamics of human behavior, markedly 
different theories lead to varied interpretations and to confusion. 

Definitive experimental research remains to be done on the nature 
Of the psychological processes involved in projective tests. 

If rapport is established with the examiner, the subject is thereby 
encouraged to verbalize and understand his behavior, attitudes, val- 
Ues, and the environmental forces. , : i 

Appropriately scaled projective techniques are of more interest to 
Subjects than are personality inventories. 

Malingering and falsification are more difficult than on personality 
inventories, though not impossible. 

Projective methods—particularly the Rorschach and the TAT— 
are frequently used by psychologists to supplement information de- 


76 Projective Methods: Various 


rived from other tests (e.g., intelligence) in order better to under- 
stand the complex of factors operative in behavior in a given instance 
or at a given time. 


Projective methods alone are not the answer to all questions re- 
garding human personality and adjustment—a claim made for them 
by some extreme enthusiasts. But in the hands of a qualified examiner 
these instruments assist in obtaining information not otherwise avail- 
able, except possibly through extended psychological interview and 
observation. 

It is apparent that interest in projective methods is widespread 
among psychologists, from the viewpoints of experimentalists, per- 
sonality theorists, and clinicians * dealing with individuals presenting 
every gradation and variety of problem (not necessarily maladjust- 
ment) from childhood through adulthood. It is to be expected, there- 
fore, that solutions of the problems of projective methods will, with 
time, be more nearly achieved and inadequacies of the instruments 
reduced. 

If the present trend continues, we may expect to have a number of 
separate projective tests—especially of the thematic picture type— 
designed for specific and limited age levels, providing relationships 
and situations typical of and significant for that group. The large crop 
of new devices, among them being some that are promising and in- 
sightful, will be “shaken down”; and the “chaff” will be separated 
from the “wheat.” 


™ When we refer to clinicians, we mean not only the psychologist who works 
with patients in a hospital or in a clinic for persons having serious behavior 
and adjustment difficulties; we mean also the psychologist who works with any 
person in an effort to deal with and solve any type of behavioral or adjustment 
situation through the use of available sound psychological techniques. In other 
words, a clinician may be one who is concerned with persons within the normal 
range, as well as one who restricts his activities to the “sick” or “near-sick.” 


or 


Mmnmumnannnntnnnnannnannnnnaa asantana 


SITUATIONAL TESTS 


AMONG the more recent developments in psychological testing are 
the situational tests which either test the individual in action or con- 
front him with situations related to his own life, in response to which 
he gives expression to his feelings for other persons. Although situa- 
tional tests are not as unstructured as the Rorschach and the TAT, 
they are in a degree projective methods; for the subject by means of 
them reveals some of his personality traits through his preference for 
Or against certain contacts with others (as in sociometric tests), and 
through his spontaneous methods of dealing with life situations, pre- 
Conceived by the examiner, that confront him (as in the psychodrama 


and in Office of Strategic Services tests). 


SOCIOMETRIC METHODS 
‘ Description. This method, credited to J. L. Moreno as the 
innovator,’ may be defined as a technique for revealing and evaluating 
the social structure of a group through the measurement of the fre- 
quency of acceptance or nonacceptance between the individuals who 
Constitute the group. It is an approach to the problem of studying in- 
terpersonal relationships. This technique permits the analysis of each 
Person's position and status within the group, with respect to a par- 
ticular criterion, (E.g.: Name the pupil in your class with whom you 
Would most like to sit at lunch; name your second choice. Name the 
cgi eee — 
"Who Shall Survive? New York: Beacon House, 1934. 


578 Situational Tests 


two persons in your class, in order of preference, whom you would 
choose as leader on a trip.) The method also reveals the organization 
of the group, as well as identifying dominant individuals, cliques, 
cleavages (sex, racial, economic, etc.), and patterns of social attrac- 
tion and rejection. The reasons for the existing patterns of attraction 
and repulsion can then be determined if the personality traits of each 
individual are known and the values of the group as a whole estab- 
lished. 

The method is a very simple one. The sociometric test requires 
that each individual in a given group choose one or more other per- 
sons in that group for a specified purpose. In a schoolroom, the pu- 
pils may be asked to name their first and second preferences next to 
whom they wish to sit, or with whom they wish to attend the movies, 
or with whom they would like to work on a project. Or they may be 
asked to name one or more individuals in the group who possess cer- 
tain specified traits; such as the opposites “talkative-silent,” “neat- 
unkempt,” etc. Sociometric tests were used in a state training school 
for girls to determine with whom each individual would prefer to 
live or work, and with whom each would not want to live or work.” 
The sociometric method was adopted for use in the armed forces in 
an effort to identify individuals for specific assignments requiring, 
for example, leadership and dependability. Thus each individual is 
viewed in his social relationship to the whole group. 

It is apparent that a sociometric test may be devised for innumera- 
ble groups and situations. The guiding principles are that each one 
must be relevant to a life situation of the group, and the items OF 
questions must be such as to require each person in the group to make 
one or more definite selections revealing certain personal affinities of 
rejections, or certain values. 

As an illustration, we take a class of seventeen pupils—seven girls 
and ten boys—in a school grade. They are asked to name the two pu- 
pils with whom they would prefer to sit at lunch. After the informa- 
tion is obtained, a sociogram is constructed. Of the several kinds of 
sociograms that have been suggested, the one shown in Figure 21.1 
commends itself because it is simply made and easy to interpret. It 
is known as the “target technique,” having been described by North- 


2H H: Jennings, “A Sociometric Study of Emotional and Social Expansive 
ness,” Chapter 30, in R. G. Barker, et al., Child Behavior and Developmen’ 
New York: McGraw-Hill, 1943. 


Sociometric Methods 579 


way.” There are four concentric circles; acceptability scores, based on 
total number of choices received by each person, are divided into 
four groups; the lowest quarter is on the outside of the target and the 
highest are in the center. Each individual is represented on the target 
according to his acceptability score. The arrows show which individ- 
uals have been selected by whom. Solid and broken lines indicate 


1.1. Sociogram of an elementary school class. 


FIG. 2 


first and second choices, respectively. It is possible, also, to divide 
the target in various ways in order to show cleavages. In this figure, 
the vertical line readily shows the intersex choices. 

Usually, an individual’s sociometric score is simply the number of 
Mentions he receives, or the percentage of mentions he receives from 
Others į ; 

fol lec the sociometry, which has been called an opinion 
test, is used to obtain individuals’ opinions of one another with re- 


A Method for Depicting Social Relationships by Socio- 


3 pany, ae 
M. L. Northway. "1940, pp. 144-150. 


Metric Testing,” Sociometry, Vol. 3 


580 Situational Tests 


gard to a number of traits. The test consists of a series of very brief 
“word pictures,” each one followed by a blank space in which the 
child is to write the names of others in the group (e.g. classmates) 
who may be like the “word picture.” One may include his own name 
if he believes the description suits him. Two items are used to de- 
scribe the extremes of each trait. For example: “Here is someone 
who finds it hard to sit still in class; he (or she) moves around 
in his (or her) seat or gets up and walks around.” “Here is some- 
one who can work quietly without moving around in his (or her) 
seat.” 

In the research reported by Tryon,‘ twenty pairs of items were 
used, describing the extremes of twenty traits, which follow: 


restless—quiet 

talkative—silent 

attention-getting—non-attention getting 
bossy—submissive 

unkempt—tidy 

fights—avoids fights 

daring—afraid 

leader—follower 

active in games—sedentary 

humor (regarding self )—humorless (regarding self) 
friendly—unfriendly 

popular—unpopular 

good-looking—not good-looking 
enthusiastic—listless 

happy—unhappy 

humor (jokes )—humorless (jokes) 

assured (with adults)—shy (with adults) 

assured (in class )—embarrassed (in class) 
grown-up—childish 

older friends (preference for)—younger friends (preference for) 


An individual's score on a given trait is determined by the number 
of times he is mentioned by his classmates on the pair of opposed 
items. The item in each pair designating activity is given a positive 
score, whereas the opposed item—inactivity—is given a negative 
value. An individual’s score on each trait is the algebraic sum of 
“positive” and “negative” mentions received. The algebraic sum of 
mentions received by each child on each trait is converted into a pro- 


'C. M. Tryon, Evaluation of Adolescent Personality by Adolescents, Society 
for Research in Child Development Monograph, Vol. 4, no. 41, 1939. 


Sociometric Methods 581 


portion of the class voting for him in order to equate for the size of 
the group. (Self-mentions are not included. ) 


Validity and Reliability. The ordinary questions have arisen in re- 
gard to reliability and validity of these sociometric devices. Cor- 
relational data indicate that mutual ratings and reratings over short 
intervals of several weeks, are highly reliable, yielding coefficients of 
about .90.* Tryon obtained on each trait for each subject two sets of 
scores by taking the score for one half of the judges (selected at ran- 
dom) and correlating these with the scores of the other half. This 
split-half method, for the twenty traits, yielded an average reliability 
coefficient of about .75. Using the test-retest method (ten-day in- 
terval), the average correlation between the two sets of scores was 
about .80. Tryon also reports that only about ten percent of the 
children and adolescents in her group received disagreeing votes on 
any pair of items; and most of these individuals were close to the 
average category in the number of votes received. Regarding individ- 
uals who clearly deviated from the average category, however, there 
was rarely any disagreement. The foregoing evidence indicates that, 
among the subjects tested, there is a high degree of consistency or con- 
currence of opinion regarding one another. 

The usual criteria and standards of validity do not apply to socio- 
Metric tests; for they do not set out to determine what are some of 
the actual personality and behavior traits of the individuals being 
rated. They are, rather, measures of the environment of opinion in 
which each individual is functioning. When children and adolescents 
express preferences for or rejection of classmates, or when they men- 
tion classmates in connection with specific traits, they are not neces- 
Sarily giving their own independent judgments. As a member of the 
group, each individual acquires, in some degree, the prevailing group 
attitudes toward his fellows. And as the object of these attitudes, 
each individual interacts in some manner with these opinions and 
the persons holding them. Since the purpose of the sociometric tech- 
Nique is not to measure the personality of each individual but to 
Measure the environment of opinion in which he lives, it is gratuitous 
to ask for or expect evidence of validity in the usual psychometric 
terms. The discovery of the environment of opinion in which an indi- 


Ses 


°E.g., W. I. New justme 
Sontales, Cleveland: School of Applied Social Sciences, Western Reserv 


versity, 1938. 


stetter, et al., Group Adjustment: A Study in Experimental 
A e Uni- 


582 Situational Tests 


vidual lives is none the less important; for such information helps ex- 
plain that person’s behavior as well as the organization of the group 
and its values. 


Uses. In analyzing a group structure by sociometric means, Moreno 
and others have used the number of “isolates,” “mutuals,” “unre- 
ciprocated choices,” and “stars” as indices of group coherence. Fre- 
quencies of intersex, interrace, internation, inter-occupational-level 
choices may be used as evidence of cleavages. 

Sociometric technique has been applied to the study of a variety 
of social situations: in classrooms, factories, camps, fraternities, and 
residential communities. These investigations have revealed the struc- 
tures of social groups and have provided the bases upon which thera- 
peutic procedures were developed or reconstruction undertaken. 


TESTS OF “SOCIAL INTELLIGENCE” AND LEADERSHIP 


In the armed forces, during World War II, a number of situ- 
ational tests were devised to aid in the selection of personnel for as- 
signment to tasks involving leadership and interpersonal relationships.” 
Two illustrations will be given. 

A “Social Manipulation Inventory” describes a variety of problem 
situations for each of which the examinee is required to indicate the 
most desirable solution. For example: 


You are a supervisor of an office force of 10 people. One member 
is habitually late. You would: 


A. Make an example of him by discharging him. 

B. Bawl him out in front of the whole group. 

C. Call him in and try to find out the reason for the tardiness. 

D: Call a mecting of the office force to explain that everyone owes 
it to the company to be on time. 

B Call him in privately for a lecture on the importance of being 
on time. 

Statistic. 


al results for this inventory showed a substantial positive 
correlation (.30 — .50) with evidence of academic intelligence, thus 
suggesting that performance scores on the inventory reflect to some 
extent the subject's knowledge of the expected behavior, but not 
necessarily his actual behavior in the specified situation. 


OJ. (Pt Guilford et al., Printed Classification Tests, Report No. 5, U. S. 
Government Printing Office, 1947, pp. 713 ff. 


Psychodrama 583 


The number and variety of problem situations in any such situa- 
tional inventory will depend upon the nature of the situation and the 
general purpose for which it is being devised. 

A second illustration is the “Pilot Behavior Blank,” devised to 
provide information on types of leadership. It was based upon the 
conception that leadership may be classified as one of three kinds, as 
divided by Kurt Lewin: laissez-faire (giving help only when requested 
to do so), authoritarian (dictatorial, domineering), and democratic 
(participating in a group on a peer basis).‘ 

This behavior blank was intended to reveal the kind of social in- 
teraction preferred by each subject, on the basis of which combat 
crews of similar preferences could be assembled. An effort was made 
to include in this preference blank an equal number of each of the 
three types of solutions to situations requiring leadership. The items 
were of the two-alternative kind, the examinee being required to in- 
dicate which procedure he prefers. For example: 


A. The pilot who gives many instructions. 
B. The pilot who, while he works more energetically than the rest 


of the crew, doesn’t expect them to work as hard as he. 

A. The pilot who lets the crew members make their own arrange- 
ments for quarters, mess, and entertainment. 

B. The pilot who talks about the crew’s good points to others. 


Although neither one of the foregoing situation-inventories was 
actually validated, they did yield results—particularly the second one 
—such as to suggest that they have possibilities for development 
into instruments of value in assessing and predicting behavior.“ 


PSYCHODRAMA 

Aside from a degree of mysticism and extravagant claims 
made for it, the psychodrama is a useful projective method in the 
Study of personality and in psychotherapy. The technique is one 
whereby an individual is required to play spontaneously an assigned 
Tole in a specified situation. The “drama” may involve two or more 


TK. Lewin et al., “Patterns of Aggressive Behavior in Experimentally Cre- 
ated Social Climates,” Journal of Social Psychology, Vol. 10, 1939, pp. 271- 
29 


“See J. Mathews. Research on the Development of Valid Situational Tests 
of Leadership: 1. Survey of the Literature, Pittsburgh: American Institute for 


esearch, 1951. 


584 Situational Tests 


persons; it deals with a situation significant in the life of one or more 
of the participants; each individual may play a role representing ei- 
ther himself or some other personality with whom he is involved.” 

The central principle of the psychodrama is spontaneity, which 
has been defined by Moreno as the ability of the subject to meet each 
new situation with adequacy, as “the most important vitalizer of 
living structure.” '’ “The creative idea itself . . . is spontaneous and 
the quality which pertains to the conception and materialization of 
such an idea is called spontaneity. Spontaneity must always occur as 
the first step towards the formation of a cultural conserve.” © In con- 
trast to the act of spontaneity stands the “cultural conserve,” which 
is the creative idea that has become preserved and static, and hence 
repeated and stereotyped. What the psychodrama aims at, therefore, 
is to develop in the subject the capacity to play his life roles in a 
spontaneous and always creative manner which enables him ade- 
quately to meet the demands of new and evolving situations, rather 
than by employing stereotyped patterns of response. 

The psychodrama involves a director, who is the therapist or the 
one studying the personalities in the situation. On the basis of knowl- 
edge of his subjects and their problems, he creates the situation, se- 
lects the actors, assigns roles, observes and interprets the action, and 
acts as the link between actors and audience. Emphatic and active 
participation by the audience is an essential of this technique; for its 
members are individuals who are or will be in situations similar to 
those being portrayed in the act. The individual who is the subject 
in the drama (or the patient, in therapy) is called the primary ego. 
He is the one being assisted in the solution of a problem of adjust- 
ment, or in learning to live a certain role in life. The auxiliary ego is 
another actor in the drama; he is the agent who provides the assist- 
ance needed by the primary ego. The auxiliary ego does so either by 
(1) acting as the primary ego, identifying with him and representing 
him toward others or (2) by acting in the role of and representing 
another person with whom the primary ego is involved. 

The foregoing are the basic procedures. A number of modifications 
and variations of the technique have been evolved. For example, 


9 See J. L. Moreno, Psychodrama, Vol. 1, New ‘Yorks Beacon House, 1946. 
10 Thid., p. 100. 
11 Jhid., p: 123. 


Psychodrama 585 


where the two persons involved in a conflict (e.g., parent and child, 
husband and wife, employer and employee) are placed in a psycho- 
dramatic situation, each may be instructed to act out his own role, 
or each may be required to act the role of the other person, as he 
perceives that other one in the specified situation. 

Briefly stated, the psychological rationale of the psychodrama is 
the following. In therapy, the subject by acting, by participating, in 
the reproduction of a life situation significant to him experiences an 
emotional catharsis.'* In the process, while he gains insight into his 
own behavior, he is also learning how to meet a situation adequately 
(spontaneously and creatively ) through observations of himself and 
through interpretations and evaluations given by the therapist (or 
director) and members of the audience. The psychodrama, thus, is 
intended to be a learning procedure which will teach the subject how 
to meet life situations in a manner adequate to each one as it arises. 
Moreno and adherents to his theory and practices report that group 
Psychodrama is especially valuable in the treatment of minor malad- 
justments, incipient neuroses, and less severe interpersonal conflicts. 
In the more severe cases of maladjustment and behavior disorder, the 
group procedure is regarded as a first step toward orientation and 
Personalized (or individual) psychodrama treatment. 

In personality evaluation by means of psychodrama, the director 
and others are able to observe and analyze each subject’s character- 
istic ways of dealing with a situation commonly encountered. A very 
wide range of themes may be used for testing behavior, depending 
upon the nature of the persons involved: economic problems, family 
relationships, social status, school status, play activities, levels of as- 
Piration and self-realization, etc. The number and kinds of themes 
are as varied as life itself. 

The psychodramatic technique, it has been urged, can also be used 
in vocational selection and training. The recommended method is that 
€ach subject be placed in one or more situations created to simulate 
a particular occupation and to observe his behavior and attitudes as 
the worker, The individual may then be asked to reverse roles. For 


12 Adherents of the psychodramatic technique maintain that the catharsis 
derived from it is much more genuine than catharsis derived from the method 
Of psychoanalysis in which the subject merely verbalizes in a remote situation 
(the therapist's office), dissociated from other persons involved in the conflict. 


586 Situational Tests 


instance, in selecting store clerks, each subject might be required to 
take the role first of salesperson and then of customer. Similar ar- 
rangements might be made for teacher-pupil, nurse-patient, foreman- 
worker, waitress-customer relationships, and many others in which 
interpersonal relationships are an essential part of the occupation. 

Moreno has analyzed spontaneity into four characteristic forms of 
expression, which follow." 


(1) Spontaneity which activates cultural conserves and social 
stereotypes. In this form of expression there is nothing creative or 
original. It is repetition of stereotyped behavior, the only spontaneous 
element in it being a newness or freshness of feeling and vivacity, 
that is, its dramatic quality. 

(2) Spontaneity which creates new forms of behavior and art, and 
new patterns of environment. Individuals having this quality are rich 
in creative ideas; they analyze and break up existing stereotypes: 
they endeavor to produce novel ideas, experiences, and objects. A 
highly spontaneous person will use his mental resources most effec- 
tively and will surpass intellectually more able individuals who do not 
possess the necessary spontaneity. [This view of Moreno’s is an over- 
simplification and very inadequate explanation of individual differ- 
ences in achievement.] 

(3) Spontaneity which originates but is not significant and unique 
enough to be creative. This form is an expansion on and variation of 
the cultural stereotype used as a model. 

(4) Spontaneity which is adequate or appropriate as a response to 
a new situation. A person might be dramatic, creative, or original; 
but his responses might, at the same time, be inappropriate to the 
situation at hand. Adequacy of spontaneity has been called “plastic 
adaptation skill” essential to “a rapidly growing organism in a rapidly 
changing environment.” 


To test an individual’s degree and kinds of spontaneity of behavior 
—which is said to be a fundamental property of his personality—he 
is placed in life situations where his actions are observed and ana- 
lyzed. In any psychodramatic situation, analysis of responses may be 
made within four categories: imaginal content, methods of percep- 
tion, field-involvement and organization (i.e., kinds of utilization and 
organization of what the subject has to work with), and social inter- 
action. Analyses and interpretations are very largely qualitative and 
subjective. 


14 Moreno, op. cit., p. 89 ff. 


Psychodrama 587 


An effort, however, has been made to evaluate the roles an in- 
dividual assumes, on the basis of eight characteristics: ™ 


(1) origin: collective or individual roles 

(2) degree of freedom of spontaneity: role-taking or role-creating 

(3) content: psychosomatic, dramatic, or social roles 

(4) quantity: adequacy or deficiency of roles, superiority or in- 

feriority of roles 

(5) time: slow, average, fast, or “overheated” in developing a role 

(6) consistency: weak, balanced, or strong 

(7) rank: dominant or recessive 

(8) form: flexible or rigid 
The kinds of roles an individual assumes, under these headings, are 
interpreted as being expressions of his personality. This method is by 
no means a quantitative measure; but it may, at least, provide the 
beginnings of an orderly analysis of behavior, which is essential if 
the psychodramatic method is to be employed on a wide scale. 

As already indicated, this technique may be used for personality 
diagnosis and for purposes of psychotherapy. It may be employed 
as an individual device, in which only therapist and subject are in- 
volved, or as a group technique in which the group may vary in size 
from three persons (therapist and two subjects) to as many as the 
director-therapist believes he can work with (performers and audi- 
ence), 

A variant of the psychodrama is called the sociodrama.™ The latter 
differs from the former in respect to purpose and emphasis. Whereas 
the psychodrama deals with interpersonal relations and adjustment 
Problems within the individual, the sociodrama is concerned with 
group values and group structure and thinking. The sociodrama por- 
trays social phenomena and conflicts with which the audience is con- 
cerned and to which a solution is being sought. 

The psychodrama is at the present time a technique still in its 
early stages, yet to be developed into a device of demonstrated diag- 
Nostic value for wide use. In spite of the fact that it is the kind of 
technique in the use of which a psychologist exercises a high degree 


“J, L. and F. B. Moreno, “Spontaneity Theory of Child Development,” 
Sociometry, Vol. 7, 1944, pp. 89-128. For a modification and elaboration of 
Psychodrama methods, see J. Del Toro and P. Corneytz, “Psychodrama as Ex- 
Pressive and Projective Technique,” Sociometry, Vol. 8, 1944, pp. 356-375. 

1J, L. Moreno, “The Concept of Sociodrama,” Sociometry, Vol. 6, 1943, 


PP. 434-449; also Psychodrama, p. 315 fi. 


588 Situational Tests 


of personal judgment and evaluation, before it can be widely applied 
and taught it must achieve a higher level of standardization in re- 
spect to techniques of observation, rating of activity, and interpre- 
tation of responses. Furthermore, the hypotheses regarding the value 
of the psychodrama as a technique for developing spontaneous, ade- 
quate, and adjusted personalities has not been sufficiently demon- 
strated. We may expect, however, that the serious students who have 
interested themselves in this method will provide evidence on this 
score. Finally, one serious obstacle to the experimental development 
and use of the psychodramatic group technique is its heavy require- 
ments of time, personnel, facilities, and equipment. These considera- 
tions will prove to be a handicap; so we may expect further research 
to come only from a few sources. 


OFFICE OF STRATEGIC SERVICES: ASSESSMENT TESTS "° 


Description and Procedure. During the late war, a group of 
psychologists and psychiatrists were given the assignment of assessing 
the traits of men and women recruited for the O.S.S., as it has come 
to be known. The task was to devise test procedures that would reveal 
the recruits’ personalities and give reliable predictions of their future 
usefulness in this branch of military service. At the main station, the 
testing period lasted three days; at another station it lasted one day. 

The number and types of jobs in this organization were large and 
varied, including script writer, base-station operator, demolitions in- 
structor, field representative, section leader, resistance-group leader, 
saboteur, undercover agent, liaison pilot, pigeoneer, and others. Very 
little indeed was known regarding the specific qualifications essential 
for successful performance in the jobs to be filled. Nor was there 
time and opportunity to proceed with job analyses, as would be done 
in the case of an ordinary civilian situation when aptitude and per- 
sonality tests for a specific occupation are to be devised. The assess- 
ment staff decided, therefore, to use the “organismic” approach: that 
is, to evaluate each personality as a whole. This meant that some 
members of the staff would provide an over-all evaluation and de- 
scription of each individual, based upon interview; each candidate 
would be tested, observed and evaluated in respect to specific traits 


16 O.S.S. Staff, Assessment of Men, New York: Rinehart, 1948. The idea for 
the organization of the assessment work and some of the procedures used was 
gained from the British War Office Selection Board. 


Office of Strategic Services: Assessment Tests 589 


of personality, intellect, and physique; and finally, all information 
for each individual would be assembled, organized, and interrelated 
to provide a complete description and evaluation of each candidate. 
On the basis of their unified conception of each individual's person- 
ality traits, the staff estimated the probable level of future perform- 
ance. For each recruit, then, a proposed assignment was determined 
upon, using as criteria the statements of the qualifications required 
for each job as formulated by each branch of the O.S.S. 

Among the variety of devices used in the assessment of the candi- 
dates were situational tests which were very similar in conception to 
those used in the psychodrama and sociodrama. A few of them are 


listed below: 


Upon arrival at the testing station, each candidate was judged ac- 
cording to the ease with which he used the fictitious name under 
which he went (candidates did not know each other's real names, 
ranks, or civilian status); physical agility in getting off the truck. 

The first day, during the welcoming talk, each candidate’s atti- 
tudes, postures, questions, and comments were noted. j 

During the first meal, each recruit’s conversation was noted (topics 
revealing identity were prohibited), as was ease of establishing con- 
tact with others. t f ; 

Various other observations were made during the first evening, in 
the free periods, when the situations were relatively unstructured, f 

Terrain test: Candidates were told that at noon of the following 
day they would be tested for ability to observe the terrain of the sta- 
tion and its buildings, and from their observations, to infer the his- 
tory of the farm. ; 

Crossing a brook: The task was for a group to carry a delicate 
instrument over a “raging torrent” with “sheer banks and to return 
with some material from the other side. Available were a few boards, 
a log, a heavy rock, lengths of rope, a pulley, and a barrel with both 
ends knocked out. All members of the group were on an equal foot- 
ing, no one having been designated as leader. The actual setting was 
a shallow, quiet stream, about eight feet wide. À D , 

Construction: Ostensibly a test of the candidate’s ability to direct 
two helpers in building with him a frame structure out of simple 
Wooden materials. Actually it was a test of leadership, emotional 
Stability, and frustration tolerance, for the two helpers were junior 
Staff members. The job of one was to act passive, sluggish, and even 
as an obstacle. The job of the other was to be aggressive, offer poor 
Suggestions, express dissatisfaction and criticism. 

Interrogation test: Candidates selected for intelligence work were 
tested for skill in eliciting information in an interview and in using 


590 Situational Tests 


the information gained. A military situation was simulated, such as 
interrogating an American soldier who has escaped from an enemy 
camp, in order to obtain information about the enemy. 

Stress interviews: This situation was devised to test the candidate's 
capacity to tolerate severe emotional and intellectual strain. Strain 
was created by rapid, confusing cross-questioning under disagreeable 
conditions with the aim of detecting flaws in a covering-up story 
which the candidate has been given twelve minutes to invent. 

Post-stress interview: Following the stress interview, each candi- 
date, individually, was placed in a relaxed atmosphere, in the presence 
of a staff member. Casual conversation was started by the latter in an 
effort to get the candidate to “let down his guard” and thus reveal 
information about himself and his experiences at the station that 
should have been kept to himself. 

Assigned leadership: Field problems were devised; in each the 
group would work as a team, with one member assigned the role of 
leader. This was used as a test of planning ability and skill in meeting 
emergencies. : 

Improvisations: For each candidate appropriate dramatic situa- 
tions were invented, to which he had to respond realistically in a 
psychodrama. These improvisations were used to test out and clarify 
“critical hunches” regarding certain personality traits of each candi- 
date. Leadership quality, moods, attitudes, and modes of conducting 
interpersonal relationships were evaluated by this method. 


After interviewing branch chiefs to obtain their notions regarding 
the necessary traits for successful performance, and after organizing 
and combining the listed traits, seven major variables emerged; and 
to evaluate these in each candidate, the several kinds of tests were 
devised, of which those described above are samples. 


The seven general variables, considered to be basic to the needs of 
the O.S.S., were: 


(1) motivation for assignment: war morale, interest in proposed 
job 

(2) energy and initiative: activity level, zest, effort, initiative 

(3) effective intelligence: practical and efficient utilization of in- 
telligence in dealing with things, people, and ideas 

(4) emotional stability: steadiness, endurance, control over dis- 
turbing emotions, freedom from neurotic tendencies 

(5) social relations: good will, team work, freedom from dis- 
turbing prejudices and annoying traits 

(6) leadership: initiative, ability to evoke cooperation, to organ- 
ize, administer, and accept responsibility 

(7) security: ability to keep secrets, caution, discretion, ability to 
bluff and mislead 


Office of Strategic Services: Assessment Tests 591 


To these seven traits, three others were subsequently added for use 
in selected jobs requiring them, namely: 
(8) physical ability: agility, daring, ruggedness, stamina 
(9) observing and reporting: observation and accurate recall of 
significant facts and their relations, evaluation and succinct reporting 
of information 
(10) propaganda skills: ability to perceive the psychological vul- 
nerability of the enemy, to devise subversive techniques, and to 
speak, write, or draw persuasively 
As each candidate was observed in the several individual and group 
situations, he was rated independently by staff members, for all the 
variables involved in a given test, on what was in fact a sixteen-point 
scale, although there were only six categories, thus: 


very inferior 

inferior 

low average 

high average 

superior 

very superior 

Since the judges found it desirable and possible to add plus or minus 
ratings to these numbers (omitting scores of O— and 5+ ), they were 
actually using a sixteen-point rating scale. The team of staff members 
later met to pool their judgments and to arrive at a final combined 


ABwWN-o 


rating for a given test situation. 

The final step in the assessment program was a staff meeting in 
which the complete reports on each candidate were presented, evalu- 
ated, criticized, in which differences of interpretation and viewpoint 
were resolved and final disposition of each recruit decided on. “The 
Purpose of the meeting was to arrive at an optimal characterization 
and evaluation of the candidate, and unless the report as presented 
was felt by the group to be completely adequate, it had to be made 
So through the cooperative effort of the group. While in the majority 
of cases the report was accepted with minor changes only, some prob- 
lematic cases were reviewed and reworked in great detail. This in- 
volved another pooling of observations and interpretations.” “In 
effect, then, the entire procedure employed situational (and other) 
test techniques, rating scales, and the case conference (clinical) 


method. 
17 Ibid., p: 212. 


592 Situational Tests 


Validity and Reliability. Since the task of the staff was to devise tests 
that would reveal personality traits for the purpose of predicting suc- 
cess in future assignments, it was necessary to appraise the forecasting 
value of the procedures being used. Even in ordinary civilian situa- 
tions, where subjects are under frequent or constant observation and 
where their effectiveness in performance can be judged in terms of 
relatively concrete outcomes, the making of ratings presents some 
serious difficulties, as we have already noted in an earlier chapter. It 
was to be expected, therefore, that evaluations of the performance 
of persons accepted after O.S.S. assessment would be even more difli- 
cult and less reliable; for these men and women were not always under 
close observation in the field; it was not always possible to rate 
their work, because often the results were intangible and deferred; 
and, for the most part, the primary judges on the job were inexperi- 
enced in making psychological evaluations. The following results 
should be viewed with these considerations in mind. 

Four techniques of appraisal were used: (1) overseas staff ap- 
praisal, (2) theatre commander appraisal, (3) reassignment area ap- 
praisal, and (4) returnee appraisal.'* In the first instance, assessment 
staff members obtained, in Europe, appraisals from the immediate 
chief or commanding officer of the person under consideration, and 
from several of his associates if possible. The second method was to 
obtain ratings and a report from the immediate superior of each in- 
dividual as he returned from an overseas assignment. The third source 
of validating information was a reallocation center to which individ- 
uals were sent for reassignment after they had completed one tour 
of duty. There they were interviewed by members of the assessment 
staff in regard to their previous field experiences and rated on 
anxiety, dejection, homesickness, irritability and quarrels, alcoholism, 
psychosomatic symptoms, and strength of complaints. The fourth 
method, returnee appraisal, consisted of obtaining from O.S.S. re- 
turnees their ratings of personnel known to them in their area of 
operations. In this connection, the technique consisted of check lists, 
questioning by the interviewer, and undirected personality descrip- 
tions. 

In all instances, by whatever method obtained, the appraisal in- 
formation was evaluated on a numerical scale and correlated with 


1s Thid., p. 397 ff. 


Office of Strategic Services: Assessment Tests 593 


over-all assessment ratings and with specific trait ratings which had 
been assigned after the initial testing veriod at the center. Basically, 
the validity coefficients depend, among other things, upon the reli- 
ability of the ratings. In the case of “returnee appraisal” the mean 
correlation between ratings provided by informants (former associ- 
ates) was low, being approximately .35. As a measure of reliability— 
agreement of ratings in this instance—this coefficient is poor and 
indicates that there were serious differences between the judgments 
of informants regarding the same individuals being evaluated. When 
appraisal ratings obtained by each of the four methods were inter- 
correlated, the coefficients varied from .46 to .59, the mean being 
approximately .52. Members of the assessment staff, however, agreed 
much more closely when they themselves rated the field performance 
of individuals on the basis of information they had obtained from 
several sources, about each person. In this instance, the mean reli- 
ability coefficient (agreement among staff conferees) was approxi- 
mately .80. 

Validity of the original assessment ratings was calculated by cor- 
relating them with appraisal ratings. The obtained coefficients were 
disappointingly low: .08 and .30 (r with reassignment area ap- 
Praisal); .19 and .21 (r with returnee appraisal); .23 and .15 (r with 
theatre commander’s comments); .37 and .53 (r with overseas staff 
appraisal). Obviously the appraisal ratings made by stafi psycholo- 
gists yielded the most satisfactory results. This is due, first, to their 
Professional experience and greater competence in evaluating behav- 
ior, and, second, to the fact that as members of the stafi they were 
applying the same criteria in the rating of field behavior as they and 
their colleagues had applied in their original assessment ratings. 

It is to be expected that correlations between assessment ratings of 
each of the separate traits and appraisal ratings will be lower than 
those already cited for over-all trait ratings, since the latter represents 
a composite of factors. The coefficients for the separate traits (cor- 
Telated against each of the methods of appraisal ) varied from —14 
to +53 the median coefficients (for each trait Correlated with os 
four methods of appraisal) ranged from .06 (social relations) to .3 


(effective intelligence). ; > ; 
Statistics of greater significance than correlation coefficients, in 


1 Two coefficients Were found for each form of appraisal because the sub- 
jects assessed at each of two stations were separately treated. 


594 Situational Tests 


judging the effectiveness of the assessment procedures, are the per- 
centages of unsatisfactory cases in the field, found among the men 
and women who had been passed as satisfactory (high or medium) 
by the assessment staff. The percentages of unsatisfactory individuals 
from the groups rated high or medium (combined), as found by 
each of the four types of appraisal were: 14.8 and 6.0 (overseas staff 
appraisal); 13.4 and 15.2 (theatre commander's comments); 11.3 
and 4.5 (reassignment-area appraisal); 16.1 and 3.5 (returnee ap- 
praisal).*” 

Although no data are available with regard to individual failures 
of personnel prior to the introduction of assessment procedures, it 
is safe to assume that they were higher, probably much higher, than 
after assessment. It would be very unusual indeed if this were not the 
case; for such has been the usual experience when psychological 
methods, though certainly imperfect, have been introduced where 
none existed before. In any event, the foregoing percentages of fail- 
ure are low, on the whole, and suggest the value of continued work 
and research with situational tests. 

The authors of the assessment study suggest that the data on the 
predictive value of the assessment procedures employed were not 
more impressive because of some or all of the following reasons: 
(1) defects of the appraisal methods; (2) defects of the assessment 
methods; (3) assessment staff’s lack of familiarity with jobs to be 
filled and conditions under which personnel would work; (4) the 
shifting and unpredictable conditions under which overseas person- 
nel worked. No doubt, each of these was in an undetermined degree 
responsible for the unimpressive validity coefficients. 

Since the termination of the war in 1945, little has been done 
along the lines of the O.S.S. situational tests, as a result probably of 
the difficulties and complexities inherent in the method, as already 
indicated. One similar project, at the University of Michigan, was 
concerned with the selection of graduate students in clinical psy- 
chology.” The subjects studied were 137 men who had already been 
accepted in a number of universities for preparation in clinical psy- 
chology, under the Veterans Administration program. Although this 
project followed the principles of the O.S.S. program, to some extent 


20 Percentages are given separately for the two classification centers. : 
2 E, L. Kelly and D. W. Fiske, The Prediction of Performance in Clinical 
Psychology, Ann Arbor: University of Michigan Press, 1951. 


Evaluation of Situational Tests 595 


adapted to civilian situations, additional forms of psychological evalu- 
ation were used, such as projective tests, personality inventories, 
achievement and aptitude tests, and interest questionnaires. Also, the 
period of observation was longer than that of the O.S.S., being nine 
days as compared with three (or at times only one). 

The purpose of the Michigan project was to discover, if possible, 
the traits that contribute to or detract from competence and success 
in the practice of clinical psychology. Since there were few adequate 
criteria beyond academic ratings in graduate studies, the findings 
were inconclusive. This is not surprising, for the observed behaviors 
and the personality traits rated in the situational tests (for example, 
cooperation, group participation, expressive movements and ability 
to empathize) are not necessarily associated with academic and in- 
tellectual aspects of professional preparation. Determination of the 
actual validity and predictive significance of the situational tests used 
in the Michigan project will have to wait until the graduate students 
who were the subjects have had significant professional experience 
and can be rated by competent judges who will use cleatly defined 


criteria, 
EVALUATION OF SITUATIONAL TESTS 


The several psychological techniques presented in this chapter 
are not at present of equal value, nor at comparable levels of develop- 
ment, nor applicable with equal facility. Sociometric methods are 
farthest along in development, can most readily be applied, and 
yield results that are most easily interpreted. They are valuable in 
the hands of professional psychologists and other professional groups, 
in furnishing descriptions of group structures and of individual status 
Within the group. They do not provide information regarding the 
causes, or factors, that account for the structure or status. These can 
be determined only through further and close study of the individuals 
and the community involved. , , 

The two situation inventories reported are illustrations of the 
questionnaire technique applied to a specific problem, devised for 
the purpose of revealing attitudes and preferences in that situation. 
As such, they should be viewed and evaluated in much the same terms 
as the personality inventories already described in an earlier age 

The psychodrama (and the sociodrama ) are based upon the a 
recognized value of psychological catharsis, but catharsis through ac: 


596 Situational Tests 


tivity rather than verbalization. The proponents of this technique 
also claim that it develops spontaneity of behavior, which promotes 
wholesome development and adjustment. This remains to be demon- 
strated. The psychodrama, in addition, is so devised as to provide the 
subjects with opportunities to gain insight into their conflicts and 
into the attitudes of other persons involved with them in the con- 
flict. In this respect, the psychological rationale appears to be sound. 
The test, however, of the validity of any technique of personality 
diagnosis or therapy is a pragmatic one. And here, while a number of 
case studies have been reported, we must still reserve final judgment 
regarding efficacy of the method until fuller statistics are available. 
Deficiency of adequate statistics is not peculiar to the psychodrama; 
all therapeutic methods are deficient in this regard. 

The situational tests used in the O.S.S. and elsewhere are basically 
sound, from the psychological viewpoint, in that they demand activ- 
ity of the subject in a situation which simulates the actual setting or 
task to be performed. The tests were devised to yield evaluations of 
specified personality traits. The creation of situational tests to bring 
out particular traits, their numerical rating on a scale, and the statis- 
tical study of their interrelations demonstrate that situational tests 
and psychodrama, as testing techniques, need not be entirely a mat- 
ter of subjective and intuitive judgment. 

It is apparent, of course, that the use of the situational test tech- 
nique is, at best, difficult; it often requires elaborate facilities and a 
staff of experienced psychologists (large or small) who are able to 
diagnose and interpret behavior is essential. The complexity and dif- 
ficulty of situational testing is evident from the necessary character- 
istics of a system of personality assessment, as advocated by the 
authors of the O.S.S. report, and which follows: “? 


(1) Social setting: the whole program to be conducted within a 
social matrix of staff and candidates, permitting frequent informal 
contacts and opportunities to observe typical modes of response to 
other persons. 

(2) Multiform procedures: many different techniques to be em- 
ployed; standardized tests, uncontrolled situations, performance tests, 
projectives methods, and interview. 

(3) Lifelike tasks: in a lifelike environment; complicated tasks 
requiring organization of thought at a high integrative level, and 


22 Assessment of Men, p. 464. 


Evaluation of Situational Tests 597 


some of them to be performed under stress and in collaboration with 


others. 
(4) Formulations of personality: collection of sufficient data to 


permit conceptualization of the form of some of the chief compo- 
nents of the personality of each individual: the formulation to be 
used as the basis in making recommendations and predictions. 

(5) Staff conference: interpretations of the behavior of each in- 
dividual at a final meeting of staff members; ratings and recommen- 


dations to be reached by consensus. 

(6) Tabulations of assessments: formulations of personality, rat- 
ings of traits, and predictions of effectiveness to be recorded for the 
purpose of statistical treatment and precise comparisons with later 
appraisals. 

(7) Valid appraisal procedures: special attention to be devoted to 
the perfection of appraisal techniques, to determine the validity of 
each test in the assessment program and of ratings of each variable. 


It is not probable that this ideal program will be achieved even 
after a long time, or even in a very few centers of personality study. 
In the meantime, close approximations can be and have been achieved 
in some psychological clinics, where all the tests can be administered 
and conditions met, excepting the situational tests. As a substitute 
for these, psychologists and other qualified persons have made de- 
tailed observations and rated behavior of subjects, not in “lifelike” 
situations but in actual life situations: for example, children in the 
classroom or on the playground, teachers in the classroom, adoles- 
cents in clubs and in games, employees at work, man and wife during 
a discussion. Obviously, there will always be some personality diag- 
noses and predictions that will have to depend upon situational tests 
(simulated, or lifelike situations) simply because it is impossible to 
place and observe the individual in the actual situation and task. For 
this purpose and for personality study of individuals under controlled 
conditions,* the situational test technique holds much promise. 


| Tor aae a subject has unresolved needs for ager sion which cause 
him to respond to situations on the basis of personalities involved rather than 
in terms of the issue at hand. This hypothesis might be specifically examined in 
a specially devised situational test of the discussion type. Or personality organi- 
zation might be probed by placing subjects in unconventional situations of 
varying degrees of stress, relatively devoid of cultural cues and barriers, so 
that culturally controlled and stereotyped behavior are much less likely to be 


Manifested. 


INDEX 


Ability 

to educe relations, 144 

range of, 20 

to verbalize, 164 
Aborn, M., 138, 175 
Abstract reasoning, 330 
Abstract thinking, 61 
Abstraction, 216 
Abstractness, 64 
Accomplishment quotient, 46 
Achievement age, 380 
Achievement quotient, 46, 381 
Achievement tests. evaluation of, 399 
Adams, W. M., 3 
Adaptation, 60 
Adaptation Board, 199 
Adaptiveness to a goal, 65 
Adult intelligence quotient, 125 
Adult level, 154 
Adult mental age, 125 
Adult scale, need for an, 165 
Age criterion, 131 
Age groups, 371 
Age-scale, 147 
Aikin, W. M., 398 
Algebra, 395 
Allport A-S Reaction Study, 468 
Allport, G. W., 453, 455, 487, 495 
Alster, B., 470 
American Council on Education: Psy- 

chological Examination for Col- 
lege Freshmen, 288 

Analogies test, 276, 348, 383 
Analogous functions, 401 


Anderson, 
Anderson, 
Anderson, 
Andrews, G., 558 
Anemia, 55 
Anxiety state, 425 
Aptitude, 7, 306 
in academic subjects, 395 
in law, 351 
mechanical, 31 
in medicine, 344 
teaching, 355 
testing program, 375 
tests of, 2, 28, 306, 336 
Aptitude Classification Tests, 333 
Aptitude tests, general evaluation of, 
372 
Area sampling, 491 
Arithmetic, 89, 164, 299 
tests of, 392 
Arithmetical reasoning, 159 
Army Alpha Test. 165 
Army Beta Examination, 
165, 248 
Army General Classification Test, 294 
Arnheim, R., 514 
Aronson, M. L., 555 
Art tests, evaluation of, 343 
Arthur, G., 217 
Arthur Point Scale, 192, 205 
revised, 207 
Artificial language, 292 
Ash, P., 501 
Aspects of Personality, 470 


Revised, 


600 


Assembly Test of General Mechanical 
Ability, 317 

Association, 68 

Associative processes, 52 

Attention, 54, 68 

Attention span, 216 

Attitudes and values, tests of, 484 

Auditory acuity, 310 

Axline, V. M., 571 


Babcock Test, 432 
Bach, G. R., 571 
Baker, H. J., 327 
Baldwin, A. L., 149, 151 
Balinsky, B., 183 
Balken, E. R., 542 
Baller, W. R., 357 
Barker, R. G., 56, 578 
Bartlett, F. C., 567 
Bayley, N., 237, 239 
Beck, S. J., 505, 513 
Bell Adjustment Inventory, 468 
Bell, J. E., 504 
Bellak, L. B., 536 
Bender Visual-Motor Gestalt 
447 
Benjamin, J. D., 525 
Bennett, G. K., 313, 322, 329 
Bernreuter Personality Inventory, 469 
Best answer test, 276 
Biersdorf, K. R., 554 
Bijou, S. W., 412 
Billingslea, F. Y., 447 
Binet, A., 9, 69, 97, 113 
early work of, 99 
major contributions of, 111 
Binet Scale, 96 
1905 version, 101 
1908 version, 105 
1911 version, 109 
early American revisions of, 114 
Biographical data questionnaires, 482 
Birren, J. E., 183 
Bizarre responses, 54 
Blacky Pictures, 555 
Block building test, 216 
Block counting test, 247 
Block design test, 162, 164 
Blocking, 55 
Blos, P., 567 
Blum, G., 555 
Boas, F., 97 


Test, 


Index 


Boyd, F., 440 
Brodman, K., 475 
Brosin, H. W., 520 
Brower, D., 52$ 
Brown, A. W., 246 
Buhler, C., 531, 574 
Burchard, K. A., 414 
Burnham, P. S., 291 
Buros, O. K., 183, 337, 386, 562, 564 
Burt, C. L., 86, 144 
Byers, J., 431, 566 


C-score, 228 

California Achievement Tests, 385 

California Test of Personality, 469 

California Tests of Mental Maturity, 
270 

Callis, R., 357 

Canter, A. H., 175 

Carl Hollow Square Scale, 214 

Carlson, R., 509 

Carp, A. L., 522 

Case of Mickey Murphy, 357 

Casuist board, 198 

Catharsis, 585 

Cattell Culture-Free Test, 256 A 

Cattell Developmental and Intelli- 
gence Scale, 230 

Cattell, J. McK., 97 

Cattell, P., 230, 234, 239 

Cattell, R. B., 257, 259, 495 

CAVD test, 73, 171 

Chance factors, 11 

Character disorders, 54 

Chave, E. J., 485 

Check lists, 383 

Chicago Nonverbal Examination, 246 

Children’s Apperception Test, 553 

Clark, R., 538 

Clark, W. W., 270, 385, 469 

Clarke, H. J., 561 

Classification test, 276, 347 

Clerical accuracy, 330 

Clerical aptitude, tests of, 326 

Clerical speed, 330 

Clerical tests, evaluation of, 328 

Clinical aspects of tests, 51 

Clinical diagnosis, 417 

Cofer, C. N., 410 

Cohen, J., 183 

College Entrance Examination Board: 
Scholastic Aptitude Test, 290 


Index 


Color vision, 308 
Combs, A. W., 538, 543 
Comparisons, quantitative. 348 
Completion test, 299, 347, 383 
Complex educational objectives, tests 
of, 397 
Complex mental processes. 112 
Complexity of tasks, 64 
Comprehension tests, 88, 163 
Compulsiveness, 54 
Comrey, A. L., 483 
Concept formation, 71 
tests of, 435 
Conrad, H. S., 499 
Content analysis, 414 
Content validation, 399 
Cook, W. W., 357 
Corey, S. M., 488 
Cornell-Coxe Performance 
Scale, 203 
Cornell Index, 37, 475 
Corneytz, P., 587 
Correlation 
biserial, 33 
effect of range on, 202 
multiple, 34 
rank-order, 23 
simple, 31 
tetrachoric, 33 
Coxe-Orleans Prognosis Test of Teach- 
ing Ability, 355 
Crawford, A. B., 291. 353 
Crawford, D. M.. 312 
Crawford, J. E., 312 
Creative abilities, 153 
Cruikshank, R. M.. 313 
Cube tapping sequences, 216 
Cube test, 199 
Cultural influences, 268 
Cunningham, B. V.. 245 
Custom-built tests, 401 
Cut-off score, 36, 477 
Cutts, R. A., 414 


Ability 


Dangling ring behavior, 222 

Darley, J. G., 471 

David, H. P., 564 

Davidson, H. H., 507, 515 

Davis, F. B., 40 f 

Davis-Eells Test of General Intelli- 
gence, 260 

Davison, A. H., 542 


601 


Decile rank, 44 

Decline in test score, 181 

DeHaan, H., 175 

Delinquents, 412 

Del Toro, J., 587 

Deri, S. K., 563 

Derived indexes, 380 

Derner, G. F., 138, 175 

Deterioration index, 426 

Detroit Clerical Aptitudes Examina- 
tion, 327 

Deviation IQ, 279 

Diagonal test, 199 

Differential Aptitude Tests, 17, 329 

Differential diagnosis, 68 

Difficulty, level of, 64 

Digit-span test, 51, 89, 164 

Digit-symbol test, 163, 164 

Directions test, 299 

Di ties, 53 

Diversity of materials, 112 

Doll, E. A., 461 

Doll play, 573 

Dot drawing test, 245 

Douglass, H. R., 353 

Drake Musical Memory Test, 338 

Draw-A-Person Test, 570 

Drawing, 569 

Drive, 63 

Durost, W. N., 245 

Durrell Analysis of Reading Diff- 
culty, 388 


Ebaugh, F. G., 525 

Economy of effort, 64 

Educational achievement, tests of, 8, 
28, 377 

Educational age, 43, 380 

Educational quotient, 46, 381 

Edwards, A. L., 488 

Efficiency index, 433 

Elkisch, P., 569 

Ellis, A., 496, 498, 499, 555 

Emotional forces, 66 

Emotional states, 68 

Energy, concentration of, 66 

Engineering and Physical Science Ap- 
titude Test, 358 

Environment, 112 

Erikson, E. H., 571 

Eron, L. D., 538 

Expectancy tables, 34 


602 


Experiments, reports of, 59 
Extrinsic factors, 405 
Eysenck, H. J., 530 


Factor analysis, 6, 72, 82 
Factors, 72 
affecting test performance, 404 
illustrations of, 88 
implications of, 94 
residual, 20 
Farnsworth, Paul R.. 337 
Feature profile test, 199 
Feingold, S. N., 259 
Fels Parent Behavior Scales, 462 
Ferguson Form Boards, 211 
Ferguson, G. A., 20 
Fernald, G. M., 197 
Figure analogies test, 289 
Fine arts and professions, 336 
Finger painting, 572 
Fiske, D. W., 594 
Five-figure board, 198 
Flanagan, J. C.. 86, 333, 469, 498 
Fleischmann, M., 564 
Fleming, E. E., 561 
Fleming, V. V., 131, 154 
Foreign languages, 396 
Forlano, G., 470 
Form boards, 218 
Fosberg, I. A., 520 
Foster, J. C., 227 
Foulds, G. A., 255 
Four-Picture Test, 556 
Frandsen, A. N., 417 
Frank, L. K., 56, 571 
Freedom in response, 53 
Freeman, E., 308 
French, J. W., 495 
Frenkel-Brunswick, E., 542 
Freud, Sigmund, 553 
Fromm, E. O., 520 
Fry, D. F., 323 
Functional analysis, 406 
Functional unities, 30 
Functions 
sampling of, 4 
tested, 58, 266 
tested in the Bellevue scale, 163 
tested in the Stanford-Binet, 141 


G.G.W.S. Object-Sorting Test, 439 
Galton, F., 96, 547 


Index 


Gardner, E. F., 385 
Gardner, R. W. 
Garfield, S. L., 
Garrett, H. E., 86, 147 
Gates Reading Readiness Test, 390 
Gelb-Goldstein Color Sorting Test. 
438 
General ability, 80 
tests of, 27 
General comprehension, 159 
General factor, 74, 267 
in the Binet scales, 142 
Gesell, A., 221, 224 
Gesell Developmental Schedules. 220 
Gibby, R. G., 522 
Gilbert, J. A., 97 
Gilliland, A. R., 239 
Goddard, H. H., 114 
Goldfarb, W., 526 
Goldman, R., 424 
Goldstein, K., 435 
Goldstein-Scheerer Cube Test, 436 
Goldstein-Scheerer Stick Test, 441 
Goldstein Tests, evaluation of, 442 
Goodenough Drawing Test, 259 
Goodenough, F. L., 227, 228 
Gorham, T. J., 353 
Gough, H. G., 475 
Graves Design Judgment Test, 342 
Greene, H. A., 395 
Griffin, C. H., 358 
Group factors, 85 
theory of, 77, 86 
Group scales 
for college freshmen, 288 
evaluation of, 293, 301 
uses of, 304 
Group testing, 67 
Group trends, 21 
Grove, W. R., 212 
Guilford, J. P., 6, 30, 86, 93, 458, 481. 
495, 519, 582 
Guilford-Zimmerman 
Survey, 478 
Guthrie, G. M., 196 


Temperament 


Haggerty-Olson-Wickman Rating 
Schedules, 460 

Halpern, F., 148 

Hanfmann, E., 436, 566 

Hanfmann-Kasanin Test, 442 

Harris, A. J., 313 


Index 


Harrison, R., 538, 541, 543 

Harrison-Stroud Reading 
Test, 390 

Harrower, M. R., 518 

Harrower-Erickson, M. R., 529 

Hartman, A. A. 

Hartwell, S. W., 558 

Hathaway, S. R., 472, 474, 475 

Havighurst, R. J., 268 

Healy Picture-Completion Test, 199 

Healy Puzzle A, 199 

Healy, W., 197 

Hearing, tests of, 307 

Heimann, R. A., 532 

Henmon-Nelson Tests of Mental 
Ability, 171, 300 

Henri, V., 101 

Henry, W. E., 541, 543 

Herring, J. P., 115 

Hertz, M. R., 506, 509, 519, 523, 525, 
528 

Higginbotham, S. A., 525 

High school tests, evaluation of, 396 

Higher mental functions, 99 

Hildreth, G., 565 

Hogan, H. P., 428 

Holzberg, J. D., 520, 562 

Holzinger, K. J.. 86 

Homographic free association test, 
550 

Hopi Indians, 268 

Hotelling, H., 86 

Hoyt, C., 20 

Hunt, H. F., 445, 555 

Hunt, J. McV., 410 

Hunt-Minnesota Test for 
Brain Damage, 444 

Hunt, W. A., 428 

Hutt, M. L., 417, 448, 522, 532, 558, 
564 


Readiness 


Organic 


Impairment, patterns of. 434 
Incentive, 63 
Individual differences, 99 
Information test, 54, 88. 159, 163, 276 
Ingraham-Clark Diagnostic Reading 
Tests, 388 
Insight Test, 568 
Institute of Educational Research In- 
telligence Scale CAVD, 298 
Intelligence 
abstract, 70 


603 


Intelligence (Continued) 
concrete, 69 
definition of, 5, 60, 111 
kinds of, 69 
social, 69 
Intelligence quotient, 46, 111 
adjusted means of, 135 
classification of, 124, 140 
corrected CA divisor for, 134 
distribution of, 122, 136, 169 
means of, 169 
qualitative significance of, 47 
quantitative significance of, 47 
standard deviations of, 169 
of the Stanford-Binet, 133 
variability of, 137 
of the Wechsler-Bellevue, 179 
Intelligence tests as clinical instru- 
ments, 403 
Interest inventories, 361 
evaluation of, 366 
Interest level, 57 
Interests, stability of, 370 
Internal consistency of tests, 10 
Interval discrimination test, 338 
Intrinsic factors in score changes, 404 
Iowa Legal Aptitude Test, 352 
Iowa Tests of Educational Achieve- 
ment, 385 
Item administration, order of, 416 
Item analysis, 38 
Item difficulty, 39 
Items, types of. 145, 383 


Jackson, R. W. B., 20 
Jastrow, J.. 97 
Jeffery, M.. 444 
Jennings, H. H., 578 
John, E., 144 

Jones, L. V., 144 
Juckem, H., 444 
Jung, C. G., 548 


554 


Kaake; N. A.. 55 
51 


Kandel, I. L.. 3 
Karlin, L., 562 

Kasanin, J., 436 

Kelley, D. M., 505 

Kelley, T. L., 41, 85, 385 
Kellogg, C. E., 248 

Kelly, E. L., 594 
Kennedy, S., 514 


604 


Kenney, K. C., 488 

Kent, G. H., 154, 548 

Kent Series of Emergency Scales, 428 
Kent-Shakow Form Board Series, 212 
Kite, E. S., 100, 105 

Klein, G. S., 514 

Klopfer, B., 505, 507, 515 

Knauber Art Ability Test, 340 
Kraepelin, E., 98 

Krugman, J. 1I., 194, 521, 526 
Krugman, M., 147 

Kuder, G. F., 20 

Kuder Preference Record, 361 
Kuhlmann-Anderson Tests, 285 
Kuhlmann, F., 114 

Kumin, E., 211 

Kutner, B., 538 


Language test, 267 
Language usage, 331 
La Piere, R. T., 488 
Lapp, C. J., 358 
Lasaga y Travieso, J. I., 536 
Law School Admissions Test. 352 
Leadership, tests of, 582 
Learning 

ability, 60 

effects of, 25 
Leeds, C. H., 357 
Lefever, D: W., 531 
Legal aptitude tests, evaluation of, 

353 

Lewin, K., 583 
Lewinski, R. J., 170 
Lewis, D., 337 
Likert, R., 486 
Lindzey, G., 487, 545 
Loevinger, J., 95 
Loftus, J. J., 470 
Logical selection test, 276 
Long, J. A., 40 
Lorge, I., 183, 252 
Lowenfeld, M., 574 


Machover, K., 570 
MacMurray, D. A.. 200 
MacPhee, H. M.. 170 
Madden, R., 385 

Madison, T., 338 

Magaret. A., 151. 414 
Make-A-Picture Story Test, 557 
Manikin test, 199 


Index 


Manual dexterity, 311 

Manual tests, 311 

Marcuse, F. L., 554 

Mare and Foal Form Board. 198 

Martinez-Arango, C., 536 

Maslow, A. H., 479 

Masserman, J. H., 542 

Matching figures, 247 

Matching test, 383 

Mathews, J., 583 

Maurer, K. M.. 227, 228, 239 

Mayman, M.. 538 

McCall, W. A., 252 

McCandless, B. R., 412 

McCary, J. L., 562 

McClelland, D. C.. 544 

McKinley, J. C., 472, 475 

McLeish, J., 340 

McNamara, W. J., 471 

McNemar, Q., 131, 141, 142, 148, 
149, 276, 493 

Mean scatter, 420 

Mechanical aptitude tests, evaluation 
of, 324 

Mechanical reasoning, 330 

Medical aptitude tests, evaluation of, 
350 

Medical College 
346, 350 

Meehl, P. E., 444, 474, 475 

Meier Art Judgment Test, 340 

Mellone, M. A., 149 

Melton, A. W., 313 

Memory, 52 

Memory factor, 78 

Memory span for digits, 160 

Mental ability, analyses of, 72 

Mental age, 43, 107, 109, 112, 133, 
134 

Mental defectives, 51, 171, 412 

Mental growth, curve of, 50 

Mental impairment, tests of, 432 

Merrill, M. A., 129, 140, 155 

Merrill-Palmer Scale of Mental Tests, 
235 

Metropolitan Readiness Test, 390 

Meyer, B. T., 512 

Michigan Picture Test, 558 

Miles, W. R., 313 

Miller Analogies Test, 296 

Miller, W. S., 296 

Minnesota Clerical Test, 328 


Admissions Test, 


Index 


*aliibesate Mechanical Assembly Test, 
Minnesota Multiphasic Personality In- 
ventory, 472 

Minnesota Paper 
_ vised), 321 

Minnesota Personality Scale, 471 
Innesota Preschool Scale, 227 

Meneses Spatial Relations Test, 320 
Innesota Teacher Attitude Inven- 
_ tory, 357 

Mittelmann, B., 475 

Mixed group scales of mental ability, 


Formboard (Re- 


Ñ 270 
fonroe Diagnostic Reading Examina- 
lion, 389 


Monroe Reading Aptitude Tests. 391 
Coney Problem Check List, 479 
Sore, B., V., 358 
loreno, F, B., 577, 584, 587 
organ, C. D., 533 
orris, C. B., 202 
orton, N, W., 248 

Mo F. A., 344 
lotivation, 66 


oes < 509, 530, 531 
nee Gardner, 453 

u Phy, L. B., 569 

Tray, H. A., 459, 533, 542 
MPSell, J. 1... 337 

ical aptitude tests, 336 
Valuation of, 338 


Napoli, P. J., 570 
ee Intelligence Test, 204 
Onal Teachers Examinations, 356 
celson, V. L., 237 
Nop Stetter, W. T., 581 
0 l V. H., 499 : 
on ANguage Multi-mental Test, 252 
ao g group scales, 241 
ovaluation of, 265 
Ae 43, 56, 57 
Name ay: M. Ls 579 
er test, 267 
Num €r-checking test, 27 
er facility, 280 


Number factor, 78 
Number series test, 289 
Numerical ability, 330 
Nutritional disturbances, 55 


Object assembly test, 162, 164, 218 

Objectivity, 2 

Obsessive-compulsive individual, 425 

Obsolete items, 152 

Oehrn, A., 98 

Office of Strategic Services: Assess- 
ment Tests, 588 

Ohio State University Psychological 
Test, 289 

Olsen, M., 353 

Operational levels, 4 

Opinion polling, 489 

Opposites test, 276, 346 

Organic damage, 54 

Organized stoci 

Originality, 80, 153 

Originals, emergence of. 66 

Osgood, C. E., 498 

Otis Group Intelligence Scale, 301 

O'Toole, C. E; 324 


ing by tests, 372 


Painting, 569 
Paper form board test, 247 
Paragraph comprehension test, 348 
Parkyn, G. W., 154 
Parten. M. 
Pas G. R.. 448 
Pastovic, J. J., 196 
Paterson, D. G., 320 
Pattern Perception Test, 253 
Patterns of response, 52 
Pegboard test, 27, 319 
Penrose, L. S., 253 
Perceiving differences, 327 
Perceiving similarities, 327 
Percentile rank, 44 
Perceptual speed, 202 
Performance materials, 130 
Performance scales. 156, 197 
evaluation of, 216 
factorial analysis of, 202 
functions tested by, 215 
Perlman, J. A., 512, 528 
Personality 
definition of, 452 
dynamics, 68 
inventories, 3, 466, 493 


606 


Personality (Continued) 
rating scales, 452 
structure, 515 
, 452, 466, 502, 547, 577 


Physical measurement, 48 

Picture arrangement test. 161, 164, 
247 

Picture completion boards, 218 

Picture completion test, 161, 164, 245 

Picture sequence test, 247 

Picture tests of personality, 551 

Pilot Behavior Blank, 583 

Pintner-Cunningham Primary Test, 
245 

Pintner Nonlanguage Series: 
mediate Test, 250 

Pintner-Paterson Scale of Perform- 
ance Tests, 198 

Pintner, R., 129, 470 

Piotrowski, Z. A., 520, 526 

Piper, A. H., 395 

Play, 571 

Plier dexterity test, 312 

Point-scale, 147 

Population sample, 2, 58 

Porteus Maze Test, 205 

Posner, R., 562 

Practice, effects of, 25 

Pre-Engineering Ability Test. 360 

Primary mental abilities, 77 
tests of, 279 

Probable error, 179 

Proficiency tests, 37, 401 

Profile chart, 284 

Profiles, 369 

Prognostic Test of Mechanical Abili- 
ties, 323 

Progressive Education Association Be- 
havior Description, 462 

Progressive Matrices Tests, 255 

Projection, 502 

Projective methods, 502, 547 

Projective questionnaire, 566 

Projective tests, 3 
evaluation of, 575 

Psychodrama, 583 

Psychological analyses, 144 

Psychological Corporation General 
Clerical Test, 327 

Psychological test, definition of, 1 


Inter- 


Index 


Psychometric pattern, 412 
Psychotic persons, 425 
Pursuit test, 318 


Qualitative analysis, 53 
Quantitative reasoning, 292 
Questionnaire, 490 


Rabin, A. I., 418, 520 
Rafferty, J. E., 565 
Random comments during testing. 54 
Random sampling, 490 
Rapaport, D., 411, 418, 506, 542, 551, 
563 
Rating scales, 3, 452 
evaluation of, 464 
Raven, J. C., 253, 255 
Raw score, 42 
Reaction time, 311 
test of, 315 
Reading tests, 387 
diagnostic, 388 
evaluation of, 392 
Reasoning, 6, 280 
Reasoning factor, 78 
Relative rank, 42 
Reliability, 10, 58 
absolute, 12 
coefficient of, 12, 20 
relative, 12 
split-half, 13 
subtest, 25 
test-retest, 12 
Remmers, H. H., 487 
Renaud, H., 542 
Responses, qualitative aspects of, 423 
Retesting, 10 
Rhode, A. R., 564, 565 
Richards, T. W., 237 
Richardson, M. W., 20 
Riess, B. F., 553 
Riggs. M. M., 414 
Rioch, M. J., 526 
Roberts, J. A. F., 149 
Roe, A., 530 
Rorschach Test, 63, 504 
evaluation of, 531 
Rosanoff, A. J., 548 
Rosenzweig Picture-Frustration Study» 
560 
Rosenzweig, S., 561 
Rote memory, 280 


Index 


Rothney, J. W. M., 532 

Rotter Incomplete Sentences Blank, 
565 

Rotter, J. B., 538, 564 

Rubinstein, B. B., 525 

Ruch, G. M., 385 

Russell, J. T., 374 


Saetveit, J. C., 337 
Sample, stratified, 3 
Sampling theory, 85 
Sanderson, H., 528 
Sanderson, M. H., 520 
Sandiford, P., 40 
Sanford, R. N., 539 
Sarason, S. B., 259, 543 
Sargent, H., 532, 567 
Saxe, C. H., 542 
Scale, 101, 112 
factorial analyzed type of, 149 
for infants and preschool children, 
220, 237 
global type of, 149 
spiral-omnibus type of, 243 
Scaling, methods of, 484 
Scatter analysis, 54, 409, 419 
Schafer, R., 421 
Scheerer, M., 435 
Schizophrenics, 434 
Schlesinger, H. J., 514 
Schneidman, E 
Scholastic achievement, tests of, 2 
School learning, 151 
Schrader, W. B., 353 
Schwarts, M. M., 562 
Schwartz, E. K., 553 
Science and engineering aptitude tests, 
evaluation of, 360 
Scientific and engineering aptitudes, 
tests of, 357 
Scorers, consistency of, 26 
Scott, W. D., 463 
Seashore-Bennett Stenographic Pro- 
ficiency Test, 402 
Seashore, C. E., 337 
Seashore, H. G., 189, 329 
Seashore Measures of Musical Talent, 
9, 27, 336 
Security-Insecurity Inventory, 479 
Seguin Form Board, 198 
Selection ratio, 370 
Sensory capacity, 314 


607 


Sensory process, 68 
Sentence-completion tests, 291, 346, 
564 
Shavzin, A. R., 522 
Ship test, 199 
Shotwell, A. M., 
Siegel, M. G 
Siipola, E. M., 
Similarities test, 88, 160, 164 
Simon, T., 105 
Simple recall, 383 
Sims SCI Occupational Rating Scale, 
488 
Sims, V. M., 488 
Singer, J. L. 
Situational tests, 
evaluation of, 595 
Sloan, W., 414 
Smith, E. R., 398 
Smith, M., 374 
Snellen Chart, 308 
Social intelligence, tests of, 582 
Social Manipulation Inventory, 582 
Social value, 65 
Sociometric methods, 577 
Space factor, 78, 202 
Space relations, 330 
Spache, G., 146 
Spatial perception, 52, 280 
Spatial relations tests, 320 
Spearman-Brown formula, 14, 16 
Spearman, C., 9, 75, 83 
Special abilities, 52 
Specific factor, 84 
Speech, manner of, 54 
Speed, 80 
Spohn, H. E., 525 
Spontaneity, 586 
SRA Youth Inventory, 478 
Stainbrook, E., 527 
Standard error of measurement, 12, 
17 
Standard score, 45 
Standards, 56 
Stanford Achievement Tests, 384 
Stanford Revision of the Binet Scale 
(1916), 115 
criticisms of, 126 
reliability of, 118 
scoring method of, 121 
validity of, 115 


39 


608 


Stanford Revision of the Binet-Simon 
Scale (1937), 129, 171, 192, 209, 
406 

evaluation of, 147 
reliability of, 132 
short scale of, 146 
validation of, 130 

Stanford Scientific Aptitude Test, 358 

Stein, M. I., 542 

Steiner, M. E., 518 

Stenquist, J. L., 317 

Stern, William, 111 

Stevenson, I., 428 

Stoddard, G. D., 64, 155 

Story completion test, 567 

Story telling test, 567 

Stratified sampling, 490 

Strength of grip, 311 

Stromberg Dexterity Test, 313 

Strong Vocational Interest Blank, 362 

Study of Values, 487 

Stuit, D. B., 352, 464, 482 

Stutsman, R., 235 

Subject age, 380 

Substitution test, 199 

Suci, G. J., 498 

Sullivan, E. T., 270 

Sullivan, P. L., 475 

Suttell, B. J., 448 

Swift, J. W., 520, 525 

Symonds, P. M., 396 

Symonds Picture-Study Test, 554 

Synonyms test, 276 

Szondi Test, 563 


Taylor, H. C., 374 
Teaching-aptitude tests, evaluation of, 
356 
Terman, E. L., 252 
Terman, L. M., 115, 
385, 409 
Terman-McNemar 
Ability, 275 
Test 
administering a, 57 
content, 68 
design, 68 
factors in selecting a, 57 
scores, interpretation of, 42 
scoring, 57 
standardization, 10 


125; 129; 276, 


Test of Mental 


Index 


Tests 
of attitudes and values, 488 
for college level, 394 
of critical thinking, 377 
for high school level, 394 
Tetrad difference, 84 
Tetrad equation, 84 
Thematic Apperception Test, 533 
Thetford, W. N., 509 
Thompson, C. W., 151, 414 
Thompson. H., 221 
Thompson Modification of the TAT, 
553 
Thomson, G. H., 85 
Thorndike, E. L., 9, 69 
Thorndike Intelligence Examination 
for High-School Graduates, 299 
Thorndike, R. L., 41 
Thornton, G. R., 519 
Thorpe. L. P., 469 
Thurstone, L. L., 6, 9, 77, 85, 95, 279, 
288, 372, 484, 551 
Thurstone, T. G., 6, 77, 279, 288 
Tiegs. E. W., 270, 385, 469 
Time interval, 24 
Time requirements, 57 
Tomkins, S. S.. 536, 544 
Toops, H. A., 289 
Traits 
definition of, 467 
measured by tests, 58 
sampling of, 4 
Travers, R. M. W., 498 
Travis-Johnston Projection Test, 551 
Triangle test, 199 
True differences, 11 
Tryon, C. M., 580 
Tryon, R. C., 86 
Tunks, L. K., 352 
Two-factor theory, 74, 83 
Two-figure board, 198 
Two-hand coordination test, 316 
Tyler, R. W., 398 


Validating objectives, 41 
Validity, 26, 58 
Binet’s criteria of, 107 
content, 399 
criteria of, 27, 29 
cross, 31 
face, 31 


sp Pctarial, 30 


Index 


Validity (Continued) 
functional, 26 
immediate, 29 
intermediate. 29 
methods of calculating, 31 
operational, 26, 30 
ultimate, 29 
Values, tests of, 487 
van Lennep, D. J., 556 
Van Wagenen-Dvorak Diagnostic Ex- 
amination of Reading Abilities, 
389 
Van Wagenen, M. J., 227 
Variable, gross, 4 
Variability differences in the Stanford- 
Binet Scale (1937), 148 
Variance, 19 
analysis of, 12, 19 
Variation, coefficients of, 169 
Varon, E. J., 113 
Verbal analogies test, 289 
Verbal factor, 78 
Verbal group scales of mental ability, 
270 
Verbal materials, 150 
Verbal meaning, 280 
Verbal reasoning, 330 
Vernon, P, E., 86. 154. 487, 519 
Vineland Social Maturity Scale, 461 
Vision, tests of, 307 
Visual acuity, 308 
Visual-motor skill, 71 
Vocabulary, 54, 88, 160, 164, 299 
Vocabulary index, 432 
Vocabulary scatter, 420 
Voelker, P. H., 327 


Walton, R. E., 558 
Watson-Glaser Test of Critical Think- 
Ing, 398 
Webb, W. B. 175 
Wechsler Adult 
(1955), 187 
Wechsler-Bellevue Scale, 156, 417 
content of, 157 
criticism and evalution of, 180, 182 
reliability of, 174 
Scoring of, 177 


Intelligence Scale 


LISPARY. B 


In 


609 


Wechsler-Bellevue Scale (Continued) 
special features of, 181 
standardization of, 165 
validity of, 167 

Wechsler, D., 63, 157, 170, 195, 418, 

475 
Wechsler Intelligence Scale for Chil- 
dren, 188 
evaluation of, 195 
validity of, 191 

Weider, H. A., 475 

Weighted score, 179, 206 

Weigl-Goldstein-Scheerer Color Form 

Sorting Test, 440 

Wellman, B. L., 237 

Welsh, G. S., 475 

Wertheimer, M., 447 

Wesman, A. G., 329 

Wexler, M., 520 

Whipple, G. M., 102 

Willerman, B., 564 

Williams, M., 527 

Windle, C., 526 

Wing, H. D., 340 

Wing Standardized Tests of Musical 

Intelligence, 337 

Wishner, J., 527 

Wold, J. A., 444 

Wolf, R., 459 

Wolff, H. G.. 475 

Wolff, W., 569 

Wood, L., 211 

Word association tests, 547 

Word definitions, 53 

Word fluency, 280 

Word fluency factor, 78 

Work sample, 401 

World War I, 241, 313 

Wright, B. A., 567 

Wrightstone, J. W., 324, 397 

Yale Educational Aptitude Battery, 

291 
Yerkes, R. M., 115, 242 
Young, R. A., 525 


Zia Indians, 268 
Zyve, D. L., 358 


Form No. 3. 
PSY, RES.L-1 
Bureau of Educational & Psychological 
Research Library. 


The book is to be returned within 
the date stamped last. 


t J Jun 1963 
Shier f ; J mi 
Darai a] ae | Bf esia 
2 2 JUN 1963 ET, MO, EE RN 
Le eee O 
25 JUNI06 


13 JOR ‘YORG: pints T E I I S ANS 


=6. ariar 7 O EASI E E E 
20 APR A 


WBGP49/60-51 19C-5M 


= 


on tni 


ee 


I.R 
FRE 


Ae 


